Merge pull request #6 from dmlc/master

update
This commit is contained in:
yanqingmen 2015-12-18 14:24:08 +08:00
commit f378fac6a1
172 changed files with 4068 additions and 1709 deletions

5
.gitignore vendored
View File

@ -66,3 +66,8 @@ java/xgboost4j-demo/tmp/
java/xgboost4j-demo/model/ java/xgboost4j-demo/model/
nb-configuration* nb-configuration*
dmlc-core dmlc-core
# Eclipse
.project
.cproject
.pydevproject
.settings/

View File

@ -37,11 +37,22 @@ xgboost-0.4
on going at master on going at master
================== ==================
* Fix List * Changes in R library
- Fixed possible problem of poisson regression for R. - fixed possible problem of poisson regression.
* Python module now throw exception instead of crash terminal when a parameter error happens. - switched from 0 to NA for missing values.
* Python module now has importance plot and tree plot functions. - exposed access to additional model parameters.
* Changes in Python library
- throws exception instead of crash terminal when a parameter error happens.
- has importance plot and tree plot functions.
- accepts different learning rates for each boosting round.
- allows model training continuation from previously saved model.
- allows early stopping in CV.
- allows feval to return a list of tuples.
- allows eval_metric to handle additional format.
- improved compatibility in sklearn module.
- additional parameters added for sklearn wrapper.
- added pip installation functionality.
- supports more Pandas DataFrame dtypes.
- added best_ntree_limit attribute, in addition to best_score and best_iteration.
* Java api is ready for use * Java api is ready for use
* Added more test cases and continuous integration to make each build more robust * Added more test cases and continuous integration to make each build more robust.
* Improvements in sklearn compatible module
* Added pip installation functionality for python module

View File

@ -13,6 +13,8 @@ Committers are people who have made substantial contribution to the project and
- Bing is the original creator of xgboost python package and currently the maintainer of [XGBoost.jl](https://github.com/antinucleon/XGBoost.jl). - Bing is the original creator of xgboost python package and currently the maintainer of [XGBoost.jl](https://github.com/antinucleon/XGBoost.jl).
* [Michael Benesty](https://github.com/pommedeterresautee) * [Michael Benesty](https://github.com/pommedeterresautee)
- Micheal is a lawyer, data scientist in France, he is the creator of xgboost interactive analysis module in R. - Micheal is a lawyer, data scientist in France, he is the creator of xgboost interactive analysis module in R.
* [Yuan Tang](https://github.com/terrytangyuan)
- Yuan is a data scientist in Chicago, US. He contributed mostly in R and Python packages.
Become a Comitter Become a Comitter
----------------- -----------------
@ -34,7 +36,6 @@ List of Contributors
* [Zygmunt Zając](https://github.com/zygmuntz) * [Zygmunt Zając](https://github.com/zygmuntz)
- Zygmunt is the master behind the early stopping feature frequently used by kagglers. - Zygmunt is the master behind the early stopping feature frequently used by kagglers.
* [Ajinkya Kale](https://github.com/ajkl) * [Ajinkya Kale](https://github.com/ajkl)
* [Yuan Tang](https://github.com/terrytangyuan)
* [Boliang Chen](https://github.com/cblsjtu) * [Boliang Chen](https://github.com/cblsjtu)
* [Vadim Khotilovich](https://github.com/khotilov) * [Vadim Khotilovich](https://github.com/khotilov)
* [Yangqing Men](https://github.com/yanqingmen) * [Yangqing Men](https://github.com/yanqingmen)
@ -49,4 +50,10 @@ List of Contributors
- Masaaki is the initial creator of xgboost python plotting module. - Masaaki is the initial creator of xgboost python plotting module.
* [Hongliang Liu](https://github.com/phunterlau) * [Hongliang Liu](https://github.com/phunterlau)
- Hongliang is the maintainer of xgboost python PyPI package for pip installation. - Hongliang is the maintainer of xgboost python PyPI package for pip installation.
* [daiyl0320](https://github.com/daiyl0320)
- daiyl0320 contributed patch to xgboost distributed version more robust, and scales stably on TB scale datasets.
* [Huayi Zhang](https://github.com/irachex) * [Huayi Zhang](https://github.com/irachex)
* [Johan Manders](https://github.com/johanmanders)
* [yoori](https://github.com/yoori)
* [Mathias Müller](https://github.com/far0n)
* [Sam Thomson](https://github.com/sammthomson)

View File

@ -177,11 +177,11 @@ Rcheck:
R CMD check --as-cran xgboost*.tar.gz R CMD check --as-cran xgboost*.tar.gz
pythonpack: pythonpack:
#make clean #for pip maintainer only
cd subtree/rabit;make clean;cd .. cd subtree/rabit;make clean;cd ..
rm -rf xgboost-deploy xgboost*.tar.gz rm -rf xgboost-deploy xgboost*.tar.gz
cp -r python-package xgboost-deploy cp -r python-package xgboost-deploy
cp *.md xgboost-deploy/ #cp *.md xgboost-deploy/
cp LICENSE xgboost-deploy/ cp LICENSE xgboost-deploy/
cp Makefile xgboost-deploy/xgboost cp Makefile xgboost-deploy/xgboost
cp -r wrapper xgboost-deploy/xgboost cp -r wrapper xgboost-deploy/xgboost
@ -189,7 +189,7 @@ pythonpack:
cp -r multi-node xgboost-deploy/xgboost cp -r multi-node xgboost-deploy/xgboost
cp -r windows xgboost-deploy/xgboost cp -r windows xgboost-deploy/xgboost
cp -r src xgboost-deploy/xgboost cp -r src xgboost-deploy/xgboost
cp python-package/setup_pip.py xgboost-deploy/setup.py
#make python #make python
pythonbuild: pythonbuild:

View File

@ -3,16 +3,16 @@ Type: Package
Title: Extreme Gradient Boosting Title: Extreme Gradient Boosting
Version: 0.4-2 Version: 0.4-2
Date: 2015-08-01 Date: 2015-08-01
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>, Michael Benesty <michael@benesty.fr> Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>,
Michael Benesty <michael@benesty.fr>
Maintainer: Tong He <hetong007@gmail.com> Maintainer: Tong He <hetong007@gmail.com>
Description: Extreme Gradient Boosting, which is an Description: Extreme Gradient Boosting, which is an efficient implementation
efficient implementation of gradient boosting framework. of gradient boosting framework. This package is its R interface. The package
This package is its R interface. The package includes efficient includes efficient linear model solver and tree learning algorithms. The package
linear model solver and tree learning algorithms. The package can automatically can automatically do parallel computation on a single machine which could be
do parallel computation on a single machine which could be more than 10 times faster more than 10 times faster than existing gradient boosting packages. It supports
than existing gradient boosting packages. It supports various various objective functions, including regression, classification and ranking.
objective functions, including regression, classification and ranking. The The package is made to be extensible, so that users are also allowed to define
package is made to be extensible, so that users are also allowed to define
their own objectives easily. their own objectives easily.
License: Apache License (== 2.0) | file LICENSE License: Apache License (== 2.0) | file LICENSE
URL: https://github.com/dmlc/xgboost URL: https://github.com/dmlc/xgboost
@ -20,16 +20,18 @@ BugReports: https://github.com/dmlc/xgboost/issues
VignetteBuilder: knitr VignetteBuilder: knitr
Suggests: Suggests:
knitr, knitr,
ggplot2 (>= 1.0.0), ggplot2 (>= 1.0.1),
DiagrammeR (>= 0.6), DiagrammeR (>= 0.8.1),
Ckmeans.1d.dp (>= 3.3.1), Ckmeans.1d.dp (>= 3.3.1),
vcd (>= 1.3), vcd (>= 1.3),
testthat testthat,
igraph (>= 1.0.1)
Depends: Depends:
R (>= 2.10) R (>= 2.10)
Imports: Imports:
Matrix (>= 1.1-0), Matrix (>= 1.1-0),
methods, methods,
data.table (>= 1.9.4), data.table (>= 1.9.6),
magrittr (>= 1.5), magrittr (>= 1.5),
stringr (>= 0.6.2) stringr (>= 0.6.2)
RoxygenNote: 5.0.1

View File

@ -1,16 +1,19 @@
# Generated by roxygen2 (4.1.1): do not edit by hand # Generated by roxygen2: do not edit by hand
export(getinfo) export(getinfo)
export(setinfo) export(setinfo)
export(slice) export(slice)
export(xgb.DMatrix) export(xgb.DMatrix)
export(xgb.DMatrix.save) export(xgb.DMatrix.save)
export(xgb.create.features)
export(xgb.cv) export(xgb.cv)
export(xgb.dump) export(xgb.dump)
export(xgb.importance) export(xgb.importance)
export(xgb.load) export(xgb.load)
export(xgb.model.dt.tree) export(xgb.model.dt.tree)
export(xgb.plot.deepness)
export(xgb.plot.importance) export(xgb.plot.importance)
export(xgb.plot.multi.trees)
export(xgb.plot.tree) export(xgb.plot.tree)
export(xgb.save) export(xgb.save)
export(xgb.save.raw) export(xgb.save.raw)
@ -23,6 +26,7 @@ importClassesFrom(Matrix,dgCMatrix)
importClassesFrom(Matrix,dgeMatrix) importClassesFrom(Matrix,dgeMatrix)
importFrom(Matrix,cBind) importFrom(Matrix,cBind)
importFrom(Matrix,colSums) importFrom(Matrix,colSums)
importFrom(Matrix,sparse.model.matrix)
importFrom(Matrix,sparseVector) importFrom(Matrix,sparseVector)
importFrom(data.table,":=") importFrom(data.table,":=")
importFrom(data.table,as.data.table) importFrom(data.table,as.data.table)
@ -35,6 +39,7 @@ importFrom(data.table,setnames)
importFrom(magrittr,"%>%") importFrom(magrittr,"%>%")
importFrom(magrittr,add) importFrom(magrittr,add)
importFrom(magrittr,not) importFrom(magrittr,not)
importFrom(stringr,str_detect)
importFrom(stringr,str_extract) importFrom(stringr,str_extract)
importFrom(stringr,str_extract_all) importFrom(stringr,str_extract_all)
importFrom(stringr,str_match) importFrom(stringr,str_match)

View File

@ -23,7 +23,6 @@ setClass('xgb.DMatrix')
#' stopifnot(all(labels2 == 1-labels)) #' stopifnot(all(labels2 == 1-labels))
#' @rdname getinfo #' @rdname getinfo
#' @export #' @export
#'
getinfo <- function(object, ...){ getinfo <- function(object, ...){
UseMethod("getinfo") UseMethod("getinfo")
} }
@ -54,4 +53,3 @@ setMethod("getinfo", signature = "xgb.DMatrix",
} }
return(ret) return(ret)
}) })

View File

@ -20,6 +20,17 @@ setClass("xgb.Booster",
#' only valid for gbtree, but not for gblinear. set it to be value bigger #' only valid for gbtree, but not for gblinear. set it to be value bigger
#' than 0. It will use all trees by default. #' than 0. It will use all trees by default.
#' @param predleaf whether predict leaf index instead. If set to TRUE, the output will be a matrix object. #' @param predleaf whether predict leaf index instead. If set to TRUE, the output will be a matrix object.
#'
#' @details
#' The option \code{ntreelimit} purpose is to let the user train a model with lots
#' of trees but use only the first trees for prediction to avoid overfitting
#' (without having to train a new model with less trees).
#'
#' The option \code{predleaf} purpose is inspired from §3.1 of the paper
#' \code{Practical Lessons from Predicting Clicks on Ads at Facebook}.
#' The idea is to use the model as a generator of new features which capture non linear link
#' from original features.
#'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost') #' data(agaricus.test, package='xgboost')
@ -29,9 +40,8 @@ setClass("xgb.Booster",
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") #' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#' pred <- predict(bst, test$data) #' pred <- predict(bst, test$data)
#' @export #' @export
#'
setMethod("predict", signature = "xgb.Booster", setMethod("predict", signature = "xgb.Booster",
definition = function(object, newdata, missing = NULL, definition = function(object, newdata, missing = NA,
outputmargin = FALSE, ntreelimit = NULL, predleaf = FALSE) { outputmargin = FALSE, ntreelimit = NULL, predleaf = FALSE) {
if (class(object) != "xgb.Booster"){ if (class(object) != "xgb.Booster"){
stop("predict: model in prediction must be of class xgb.Booster") stop("predict: model in prediction must be of class xgb.Booster")
@ -39,11 +49,7 @@ setMethod("predict", signature = "xgb.Booster",
object <- xgb.Booster.check(object, saveraw = FALSE) object <- xgb.Booster.check(object, saveraw = FALSE)
} }
if (class(newdata) != "xgb.DMatrix") { if (class(newdata) != "xgb.DMatrix") {
if (is.null(missing)) { newdata <- xgb.DMatrix(newdata, missing = missing)
newdata <- xgb.DMatrix(newdata)
} else {
newdata <- xgb.DMatrix(newdata, missing = missing)
}
} }
if (is.null(ntreelimit)) { if (is.null(ntreelimit)) {
ntreelimit <- 0 ntreelimit <- 0
@ -52,7 +58,7 @@ setMethod("predict", signature = "xgb.Booster",
stop("predict: ntreelimit must be equal to or greater than 1") stop("predict: ntreelimit must be equal to or greater than 1")
} }
} }
option = 0 option <- 0
if (outputmargin) { if (outputmargin) {
option <- option + 1 option <- option + 1
} }
@ -72,4 +78,3 @@ setMethod("predict", signature = "xgb.Booster",
} }
return(ret) return(ret)
}) })

View File

@ -13,7 +13,6 @@ setMethod("predict", signature = "xgb.Booster.handle",
bst <- xgb.handleToBooster(object) bst <- xgb.handleToBooster(object)
ret = predict(bst, ...) ret <- predict(bst, ...)
return(ret) return(ret)
}) })

View File

@ -21,7 +21,6 @@
#' stopifnot(all(labels2 == 1-labels)) #' stopifnot(all(labels2 == 1-labels))
#' @rdname setinfo #' @rdname setinfo
#' @export #' @export
#'
setinfo <- function(object, ...){ setinfo <- function(object, ...){
UseMethod("setinfo") UseMethod("setinfo")
} }

View File

@ -13,7 +13,6 @@ setClass('xgb.DMatrix')
#' dsub <- slice(dtrain, 1:3) #' dsub <- slice(dtrain, 1:3)
#' @rdname slice #' @rdname slice
#' @export #' @export
#'
slice <- function(object, ...){ slice <- function(object, ...){
UseMethod("slice") UseMethod("slice")
} }
@ -34,8 +33,8 @@ setMethod("slice", signature = "xgb.DMatrix",
attr_list <- attributes(object) attr_list <- attributes(object)
nr <- xgb.numrow(object) nr <- xgb.numrow(object)
len <- sapply(attr_list,length) len <- sapply(attr_list,length)
ind <- which(len==nr) ind <- which(len == nr)
if (length(ind)>0) { if (length(ind) > 0) {
nms <- names(attr_list)[ind] nms <- names(attr_list)[ind]
for (i in 1:length(ind)) { for (i in 1:length(ind)) {
attr(ret,nms[i]) <- attr(object,nms[i])[idxset] attr(ret,nms[i]) <- attr(object,nms[i])[idxset]

View File

@ -1,4 +1,4 @@
#' @importClassesFrom Matrix dgCMatrix dgeMatrix #' @importClassesFrom Matrix dgCMatrix dgeMatrix
#' @import methods #' @import methods
# depends on matrix # depends on matrix
@ -15,14 +15,14 @@ xgb.setinfo <- function(dmat, name, info) {
stop("xgb.setinfo: first argument dtrain must be xgb.DMatrix") stop("xgb.setinfo: first argument dtrain must be xgb.DMatrix")
} }
if (name == "label") { if (name == "label") {
if (length(info)!=xgb.numrow(dmat)) if (length(info) != xgb.numrow(dmat))
stop("The length of labels must equal to the number of rows in the input data") stop("The length of labels must equal to the number of rows in the input data")
.Call("XGDMatrixSetInfo_R", dmat, name, as.numeric(info), .Call("XGDMatrixSetInfo_R", dmat, name, as.numeric(info),
PACKAGE = "xgboost") PACKAGE = "xgboost")
return(TRUE) return(TRUE)
} }
if (name == "weight") { if (name == "weight") {
if (length(info)!=xgb.numrow(dmat)) if (length(info) != xgb.numrow(dmat))
stop("The length of weights must equal to the number of rows in the input data") stop("The length of weights must equal to the number of rows in the input data")
.Call("XGDMatrixSetInfo_R", dmat, name, as.numeric(info), .Call("XGDMatrixSetInfo_R", dmat, name, as.numeric(info),
PACKAGE = "xgboost") PACKAGE = "xgboost")
@ -36,7 +36,7 @@ xgb.setinfo <- function(dmat, name, info) {
return(TRUE) return(TRUE)
} }
if (name == "group") { if (name == "group") {
if (sum(info)!=xgb.numrow(dmat)) if (sum(info) != xgb.numrow(dmat))
stop("The sum of groups must equal to the number of rows in the input data") stop("The sum of groups must equal to the number of rows in the input data")
.Call("XGDMatrixSetInfo_R", dmat, name, as.integer(info), .Call("XGDMatrixSetInfo_R", dmat, name, as.integer(info),
PACKAGE = "xgboost") PACKAGE = "xgboost")
@ -103,18 +103,13 @@ xgb.Booster.check <- function(bst, saveraw = TRUE)
## ----the following are low level iteratively function, not needed if ## ----the following are low level iteratively function, not needed if
## you do not want to use them --------------------------------------- ## you do not want to use them ---------------------------------------
# get dmatrix from data, label # get dmatrix from data, label
xgb.get.DMatrix <- function(data, label = NULL, missing = NULL, weight = NULL) { xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL) {
inClass <- class(data) inClass <- class(data)
if (inClass == "dgCMatrix" || inClass == "matrix") { if (inClass == "dgCMatrix" || inClass == "matrix") {
if (is.null(label)) { if (is.null(label)) {
stop("xgboost: need label when data is a matrix") stop("xgboost: need label when data is a matrix")
} }
dtrain <- xgb.DMatrix(data, label = label) dtrain <- xgb.DMatrix(data, label = label, missing = missing)
if (is.null(missing)){
dtrain <- xgb.DMatrix(data, label = label)
} else {
dtrain <- xgb.DMatrix(data, label = label, missing = missing)
}
if (!is.null(weight)){ if (!is.null(weight)){
xgb.setinfo(dtrain, "weight", weight) xgb.setinfo(dtrain, "weight", weight)
} }
@ -128,7 +123,7 @@ xgb.get.DMatrix <- function(data, label = NULL, missing = NULL, weight = NULL) {
dtrain <- data dtrain <- data
} else if (inClass == "data.frame") { } else if (inClass == "data.frame") {
stop("xgboost only support numerical matrix input, stop("xgboost only support numerical matrix input,
use 'data.frame' to transform the data.") use 'data.matrix' to transform the data.")
} else { } else {
stop("xgboost: Invalid input of data") stop("xgboost: Invalid input of data")
} }
@ -147,8 +142,7 @@ xgb.iter.boost <- function(booster, dtrain, gpair) {
if (class(dtrain) != "xgb.DMatrix") { if (class(dtrain) != "xgb.DMatrix") {
stop("xgb.iter.update: second argument must be type xgb.DMatrix") stop("xgb.iter.update: second argument must be type xgb.DMatrix")
} }
.Call("XGBoosterBoostOneIter_R", booster, dtrain, gpair$grad, gpair$hess, .Call("XGBoosterBoostOneIter_R", booster, dtrain, gpair$grad, gpair$hess, PACKAGE = "xgboost")
PACKAGE = "xgboost")
return(TRUE) return(TRUE)
} }
@ -164,7 +158,7 @@ xgb.iter.update <- function(booster, dtrain, iter, obj = NULL) {
if (is.null(obj)) { if (is.null(obj)) {
.Call("XGBoosterUpdateOneIter_R", booster, as.integer(iter), dtrain, .Call("XGBoosterUpdateOneIter_R", booster, as.integer(iter), dtrain,
PACKAGE = "xgboost") PACKAGE = "xgboost")
} else { } else {
pred <- predict(booster, dtrain) pred <- predict(booster, dtrain)
gpair <- obj(pred, dtrain) gpair <- obj(pred, dtrain)
succ <- xgb.iter.boost(booster, dtrain, gpair) succ <- xgb.iter.boost(booster, dtrain, gpair)
@ -257,17 +251,17 @@ xgb.cv.mknfold <- function(dall, nfold, param, stratified, folds) {
# make simple non-stratified folds # make simple non-stratified folds
kstep <- length(randidx) %/% nfold kstep <- length(randidx) %/% nfold
folds <- list() folds <- list()
for (i in 1:(nfold-1)) { for (i in 1:(nfold - 1)) {
folds[[i]] = randidx[1:kstep] folds[[i]] <- randidx[1:kstep]
randidx = setdiff(randidx, folds[[i]]) randidx <- setdiff(randidx, folds[[i]])
} }
folds[[nfold]] = randidx folds[[nfold]] <- randidx
} }
} }
ret <- list() ret <- list()
for (k in 1:nfold) { for (k in 1:nfold) {
dtest <- slice(dall, folds[[k]]) dtest <- slice(dall, folds[[k]])
didx = c() didx <- c()
for (i in 1:nfold) { for (i in 1:nfold) {
if (i != k) { if (i != k) {
didx <- append(didx, folds[[i]]) didx <- append(didx, folds[[i]])
@ -275,7 +269,7 @@ xgb.cv.mknfold <- function(dall, nfold, param, stratified, folds) {
} }
dtrain <- slice(dall, didx) dtrain <- slice(dall, didx)
bst <- xgb.Booster(param, list(dtrain, dtest)) bst <- xgb.Booster(param, list(dtrain, dtest))
watchlist = list(train=dtrain, test=dtest) watchlist <- list(train=dtrain, test=dtest)
ret[[k]] <- list(dtrain=dtrain, booster=bst, watchlist=watchlist, index=folds[[k]]) ret[[k]] <- list(dtrain=dtrain, booster=bst, watchlist=watchlist, index=folds[[k]])
} }
return (ret) return (ret)
@ -316,9 +310,9 @@ xgb.createFolds <- function(y, k = 10)
## At most, we will use quantiles. If the sample ## At most, we will use quantiles. If the sample
## is too small, we just do regular unstratified ## is too small, we just do regular unstratified
## CV ## CV
cuts <- floor(length(y)/k) cuts <- floor(length(y) / k)
if(cuts < 2) cuts <- 2 if (cuts < 2) cuts <- 2
if(cuts > 5) cuts <- 5 if (cuts > 5) cuts <- 5
y <- cut(y, y <- cut(y,
unique(stats::quantile(y, probs = seq(0, 1, length = cuts))), unique(stats::quantile(y, probs = seq(0, 1, length = cuts))),
include.lowest = TRUE) include.lowest = TRUE)

View File

@ -17,8 +17,7 @@
#' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data') #' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
#' dtrain <- xgb.DMatrix('xgb.DMatrix.data') #' dtrain <- xgb.DMatrix('xgb.DMatrix.data')
#' @export #' @export
#' xgb.DMatrix <- function(data, info = list(), missing = NA, ...) {
xgb.DMatrix <- function(data, info = list(), missing = 0, ...) {
if (typeof(data) == "character") { if (typeof(data) == "character") {
handle <- .Call("XGDMatrixCreateFromFile_R", data, as.integer(FALSE), handle <- .Call("XGDMatrixCreateFromFile_R", data, as.integer(FALSE),
PACKAGE = "xgboost") PACKAGE = "xgboost")

View File

@ -12,7 +12,6 @@
#' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data') #' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
#' dtrain <- xgb.DMatrix('xgb.DMatrix.data') #' dtrain <- xgb.DMatrix('xgb.DMatrix.data')
#' @export #' @export
#'
xgb.DMatrix.save <- function(DMatrix, fname) { xgb.DMatrix.save <- function(DMatrix, fname) {
if (typeof(fname) != "character") { if (typeof(fname) != "character") {
stop("xgb.save: fname must be character") stop("xgb.save: fname must be character")

View File

@ -0,0 +1,91 @@
#' Create new features from a previously learned model
#'
#' May improve the learning by adding new features to the training data based on the decision trees from a previously learned model.
#'
#' @importFrom magrittr %>%
#' @importFrom Matrix cBind
#' @importFrom Matrix sparse.model.matrix
#'
#' @param model decision tree boosting model learned on the original data
#' @param training.data original data (usually provided as a \code{dgCMatrix} matrix)
#'
#' @return \code{dgCMatrix} matrix including both the original data and the new features.
#'
#' @details
#' This is the function inspired from the paragraph 3.1 of the paper:
#'
#' \strong{Practical Lessons from Predicting Clicks on Ads at Facebook}
#'
#' \emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers,
#' Joaquin Quiñonero Candela)}
#'
#' International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
#'
#' \url{https://research.facebook.com/publications/758569837499391/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
#'
#' Extract explaining the method:
#'
#' "\emph{We found that boosted decision trees are a powerful and very
#' convenient way to implement non-linear and tuple transformations
#' of the kind we just described. We treat each individual
#' tree as a categorical feature that takes as value the
#' index of the leaf an instance ends up falling in. We use
#' 1-of-K coding of this type of features.
#'
#' For example, consider the boosted tree model in Figure 1 with 2 subtrees,
#' where the first subtree has 3 leafs and the second 2 leafs. If an
#' instance ends up in leaf 2 in the first subtree and leaf 1 in
#' second subtree, the overall input to the linear classifier will
#' be the binary vector \code{[0, 1, 0, 1, 0]}, where the first 3 entries
#' correspond to the leaves of the first subtree and last 2 to
#' those of the second subtree.
#'
#' [...]
#'
#' We can understand boosted decision tree
#' based transformation as a supervised feature encoding that
#' converts a real-valued vector into a compact binary-valued
#' vector. A traversal from root node to a leaf node represents
#' a rule on certain features.}"
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#' dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
#' dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
#'
#' param <- list(max.depth=2, eta=1, silent=1, objective='binary:logistic')
#' nround = 4
#'
#' bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
#'
#' # Model accuracy without new features
#' accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
#'
#' # Convert previous features to one hot encoding
#' new.features.train <- xgb.create.features(model = bst, agaricus.train$data)
#' new.features.test <- xgb.create.features(model = bst, agaricus.test$data)
#'
#' # learning with new features
#' new.dtrain <- xgb.DMatrix(data = new.features.train, label = agaricus.train$label)
#' new.dtest <- xgb.DMatrix(data = new.features.test, label = agaricus.test$label)
#' watchlist <- list(train = new.dtrain)
#' bst <- xgb.train(params = param, data = new.dtrain, nrounds = nround, nthread = 2)
#'
#' # Model accuracy with new features
#' accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
#'
#' # Here the accuracy was already good and is now perfect.
#' cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\n"))
#'
#' @export
xgb.create.features <- function(model, training.data){
pred_with_leaf = predict(model, training.data, predleaf = TRUE)
cols <- list()
for(i in 1:length(trees)){
# max is not the real max but it s not important for the purpose of adding features
leaf.id <- sort(unique(pred_with_leaf[,i]))
cols[[i]] <- factor(x = pred_with_leaf[,i], level = leaf.id)
}
cBind(training.data, sparse.model.matrix( ~ . -1, as.data.frame(cols)))
}

View File

@ -90,8 +90,7 @@
#' max.depth =3, eta = 1, objective = "binary:logistic") #' max.depth =3, eta = 1, objective = "binary:logistic")
#' print(history) #' print(history)
#' @export #' @export
#' xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing = NA,
xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing = NULL,
prediction = FALSE, showsd = TRUE, metrics=list(), prediction = FALSE, showsd = TRUE, metrics=list(),
obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T, print.every.n=1L, obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T, print.every.n=1L,
early.stop.round = NULL, maximize = NULL, ...) { early.stop.round = NULL, maximize = NULL, ...) {
@ -99,7 +98,7 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
stop("xgb.cv: first argument params must be list") stop("xgb.cv: first argument params must be list")
} }
if(!is.null(folds)) { if(!is.null(folds)) {
if(class(folds)!="list" | length(folds) < 2) { if(class(folds) != "list" | length(folds) < 2) {
stop("folds must be a list with 2 or more elements that are vectors of indices for each CV-fold") stop("folds must be a list with 2 or more elements that are vectors of indices for each CV-fold")
} }
nfold <- length(folds) nfold <- length(folds)
@ -107,15 +106,11 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
if (nfold <= 1) { if (nfold <= 1) {
stop("nfold must be bigger than 1") stop("nfold must be bigger than 1")
} }
if (is.null(missing)) { dtrain <- xgb.get.DMatrix(data, label, missing)
dtrain <- xgb.get.DMatrix(data, label) dot.params <- list(...)
} else { nms.params <- names(params)
dtrain <- xgb.get.DMatrix(data, label, missing) nms.dot.params <- names(dot.params)
} if (length(intersect(nms.params,nms.dot.params)) > 0)
dot.params = list(...)
nms.params = names(params)
nms.dot.params = names(dot.params)
if (length(intersect(nms.params,nms.dot.params))>0)
stop("Duplicated defined term in parameters. Please check your list of params.") stop("Duplicated defined term in parameters. Please check your list of params.")
params <- append(params, dot.params) params <- append(params, dot.params)
params <- append(params, list(silent=1)) params <- append(params, list(silent=1))
@ -127,16 +122,16 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
if (!is.null(params$objective) && !is.null(obj)) if (!is.null(params$objective) && !is.null(obj))
stop("xgb.cv: cannot assign two different objectives") stop("xgb.cv: cannot assign two different objectives")
if (!is.null(params$objective)) if (!is.null(params$objective))
if (class(params$objective)=='function') { if (class(params$objective) == 'function') {
obj = params$objective obj <- params$objective
params[['objective']] = NULL params[['objective']] <- NULL
} }
# if (!is.null(params$eval_metric) && !is.null(feval)) # if (!is.null(params$eval_metric) && !is.null(feval))
# stop("xgb.cv: cannot assign two different evaluation metrics") # stop("xgb.cv: cannot assign two different evaluation metrics")
if (!is.null(params$eval_metric)) if (!is.null(params$eval_metric))
if (class(params$eval_metric)=='function') { if (class(params$eval_metric) == 'function') {
feval = params$eval_metric feval <- params$eval_metric
params[['eval_metric']] = NULL params[['eval_metric']] <- NULL
} }
# Early Stopping # Early Stopping
@ -148,39 +143,39 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
if (is.null(maximize)) if (is.null(maximize))
{ {
if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) { if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) {
maximize = FALSE maximize <- FALSE
} else { } else {
maximize = TRUE maximize <- TRUE
} }
} }
if (maximize) { if (maximize) {
bestScore = 0 bestScore <- 0
} else { } else {
bestScore = Inf bestScore <- Inf
} }
bestInd = 0 bestInd <- 0
earlyStopflag = FALSE earlyStopflag <- FALSE
if (length(metrics)>1) if (length(metrics) > 1)
warning('Only the first metric is used for early stopping process.') warning('Only the first metric is used for early stopping process.')
} }
xgb_folds <- xgb.cv.mknfold(dtrain, nfold, params, stratified, folds) xgb_folds <- xgb.cv.mknfold(dtrain, nfold, params, stratified, folds)
obj_type = params[['objective']] obj_type <- params[['objective']]
mat_pred = FALSE mat_pred <- FALSE
if (!is.null(obj_type) && obj_type=='multi:softprob') if (!is.null(obj_type) && obj_type == 'multi:softprob')
{ {
num_class = params[['num_class']] num_class <- params[['num_class']]
if (is.null(num_class)) if (is.null(num_class))
stop('must set num_class to use softmax') stop('must set num_class to use softmax')
predictValues <- matrix(0,xgb.numrow(dtrain),num_class) predictValues <- matrix(0,xgb.numrow(dtrain),num_class)
mat_pred = TRUE mat_pred <- TRUE
} }
else else
predictValues <- rep(0,xgb.numrow(dtrain)) predictValues <- rep(0,xgb.numrow(dtrain))
history <- c() history <- c()
print.every.n = max(as.integer(print.every.n), 1L) print.every.n <- max(as.integer(print.every.n), 1L)
for (i in 1:nrounds) { for (i in 1:nrounds) {
msg <- list() msg <- list()
for (k in 1:nfold) { for (k in 1:nfold) {
@ -191,46 +186,44 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
ret <- xgb.cv.aggcv(msg, showsd) ret <- xgb.cv.aggcv(msg, showsd)
history <- c(history, ret) history <- c(history, ret)
if(verbose) if(verbose)
if (0==(i-1L)%%print.every.n) if (0 == (i - 1L) %% print.every.n)
cat(ret, "\n", sep="") cat(ret, "\n", sep="")
# early_Stopping # early_Stopping
if (!is.null(early.stop.round)){ if (!is.null(early.stop.round)){
score = strsplit(ret,'\\s+')[[1]][1+length(metrics)+2] score <- strsplit(ret,'\\s+')[[1]][1 + length(metrics) + 2]
score = strsplit(score,'\\+|:')[[1]][[2]] score <- strsplit(score,'\\+|:')[[1]][[2]]
score = as.numeric(score) score <- as.numeric(score)
if ((maximize && score>bestScore) || (!maximize && score<bestScore)) { if ( (maximize && score > bestScore) || (!maximize && score < bestScore)) {
bestScore = score bestScore <- score
bestInd = i bestInd <- i
} else { } else {
if (i-bestInd>=early.stop.round) { if (i - bestInd >= early.stop.round) {
earlyStopflag = TRUE earlyStopflag <- TRUE
cat('Stopping. Best iteration:',bestInd) cat('Stopping. Best iteration:',bestInd)
break break
} }
} }
} }
} }
if (prediction) { if (prediction) {
for (k in 1:nfold) { for (k in 1:nfold) {
fd = xgb_folds[[k]] fd <- xgb_folds[[k]]
if (!is.null(early.stop.round) && earlyStopflag) { if (!is.null(early.stop.round) && earlyStopflag) {
res = xgb.iter.eval(fd$booster, fd$watchlist, bestInd - 1, feval, prediction) res <- xgb.iter.eval(fd$booster, fd$watchlist, bestInd - 1, feval, prediction)
} else { } else {
res = xgb.iter.eval(fd$booster, fd$watchlist, nrounds - 1, feval, prediction) res <- xgb.iter.eval(fd$booster, fd$watchlist, nrounds - 1, feval, prediction)
} }
if (mat_pred) { if (mat_pred) {
pred_mat = matrix(res[[2]],num_class,length(fd$index)) pred_mat <- matrix(res[[2]],num_class,length(fd$index))
predictValues[fd$index,] = t(pred_mat) predictValues[fd$index,] <- t(pred_mat)
} else { } else {
predictValues[fd$index] = res[[2]] predictValues[fd$index] <- res[[2]]
} }
} }
} }
colnames <- str_split(string = history[1], pattern = "\t")[[1]] %>% .[2:length(.)] %>% str_extract(".*:") %>% str_replace(":","") %>% str_replace("-", ".") colnames <- str_split(string = history[1], pattern = "\t")[[1]] %>% .[2:length(.)] %>% str_extract(".*:") %>% str_replace(":","") %>% str_replace("-", ".")
colnamesMean <- paste(colnames, "mean") colnamesMean <- paste(colnames, "mean")
if(showsd) colnamesStd <- paste(colnames, "std") if(showsd) colnamesStd <- paste(colnames, "std")
@ -243,10 +236,10 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
dt <- utils::read.table(text = "", colClasses = type, col.names = colnames) %>% as.data.table dt <- utils::read.table(text = "", colClasses = type, col.names = colnames) %>% as.data.table
split <- str_split(string = history, pattern = "\t") split <- str_split(string = history, pattern = "\t")
for(line in split) dt <- line[2:length(line)] %>% str_extract_all(pattern = "\\d*\\.+\\d*") %>% unlist %>% as.numeric %>% as.list %>% {rbindlist(list(dt, .), use.names = F, fill = F)} for(line in split) dt <- line[2:length(line)] %>% str_extract_all(pattern = "\\d*\\.+\\d*") %>% unlist %>% as.numeric %>% as.list %>% {rbindlist( list( dt, .), use.names = F, fill = F)}
if (prediction) { if (prediction) {
return(list(dt = dt,pred = predictValues)) return( list( dt = dt,pred = predictValues))
} }
return(dt) return(dt)
} }

View File

@ -36,7 +36,6 @@
#' # print the model without saving it to a file #' # print the model without saving it to a file
#' print(xgb.dump(bst)) #' print(xgb.dump(bst))
#' @export #' @export
#'
xgb.dump <- function(model = NULL, fname = NULL, fmap = "", with.stats=FALSE) { xgb.dump <- function(model = NULL, fname = NULL, fmap = "", with.stats=FALSE) {
if (class(model) != "xgb.Booster") { if (class(model) != "xgb.Booster") {
stop("model: argument must be type xgb.Booster") stop("model: argument must be type xgb.Booster")

View File

@ -1,7 +1,6 @@
#' Show importance of features in a model #' Show importance of features in a model
#' #'
#' Read a xgboost model text dump. #' Create a \code{data.table} of the most important features of a model.
#' Can be tree or linear model (text dump of linear model are only supported in dev version of \code{Xgboost} for now).
#' #'
#' @importFrom data.table data.table #' @importFrom data.table data.table
#' @importFrom data.table setnames #' @importFrom data.table setnames
@ -11,34 +10,30 @@
#' @importFrom Matrix cBind #' @importFrom Matrix cBind
#' @importFrom Matrix sparseVector #' @importFrom Matrix sparseVector
#' #'
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}. #' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' #' @param model generated by the \code{xgb.train} function.
#' @param filename_dump the path to the text file storing the model. Model dump must include the gain per feature and per tree (\code{with.stats = T} in function \code{xgb.dump}).
#'
#' @param model generated by the \code{xgb.train} function. Avoid the creation of a dump file.
#'
#' @param data the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional. #' @param data the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.
#'
#' @param label the label vetor used for the training step. Will be used with \code{data} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional. #' @param label the label vetor used for the training step. Will be used with \code{data} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.
#'
#' @param target a function which returns \code{TRUE} or \code{1} when an observation should be count as a co-occurence and \code{FALSE} or \code{0} otherwise. Default function is provided for computing co-occurences in a binary classification. The \code{target} function should have only one parameter. This parameter will be used to provide each important feature vector after having applied the split condition, therefore these vector will be only made of 0 and 1 only, whatever was the information before. More information in \code{Detail} part. This parameter is optional. #' @param target a function which returns \code{TRUE} or \code{1} when an observation should be count as a co-occurence and \code{FALSE} or \code{0} otherwise. Default function is provided for computing co-occurences in a binary classification. The \code{target} function should have only one parameter. This parameter will be used to provide each important feature vector after having applied the split condition, therefore these vector will be only made of 0 and 1 only, whatever was the information before. More information in \code{Detail} part. This parameter is optional.
#' #'
#' @return A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model. #' @return A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
#' #'
#' @details #' @details
#' This is the function to understand the model trained (and through your model, your data). #' This function is for both linear and tree models.
#'
#' Results are returned for both linear and tree models.
#' #'
#' \code{data.table} is returned by the function. #' \code{data.table} is returned by the function.
#' There are 3 columns : #' The columns are :
#' \itemize{ #' \itemize{
#' \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump. #' \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump;
#' \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training ; #' \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training (only available for tree models);
#' \item \code{Cover} metric of the number of observation related to this feature (only available for tree models) ; #' \item \code{Cover} metric of the number of observation related to this feature (only available for tree models);
#' \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees. \code{Gain} should be prefered to search the most important feature. For boosted linear model, this column has no meaning. #' \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees.
#' } #' }
#' #'
#' If you don't provide \code{feature_names}, index of the features will be used instead.
#'
#' Because the index is extracted from the model dump (made on the C++ side), it starts at 0 (usual in C++) instead of 1 (usual in R).
#'
#' Co-occurence count #' Co-occurence count
#' ------------------ #' ------------------
#' #'
@ -51,35 +46,26 @@
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' #'
#' # Both dataset are list with two items, a sparse matrix and labels #' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' # (labels = outcome column which will be learned).
#' # Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") #' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#' #'
#' # train$data@@Dimnames[[2]] represents the column names of the sparse matrix. #' # agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.importance(train$data@@Dimnames[[2]], model = bst) #' xgb.importance(agaricus.train$data@@Dimnames[[2]], model = bst)
#' #'
#' # Same thing with co-occurence computation this time #' # Same thing with co-occurence computation this time
#' xgb.importance(train$data@@Dimnames[[2]], model = bst, data = train$data, label = train$label) #' xgb.importance(agaricus.train$data@@Dimnames[[2]], model = bst, data = agaricus.train$data, label = agaricus.train$label)
#' #'
#' @export #' @export
xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = NULL, data = NULL, label = NULL, target = function(x) ((x + label) == 2)){ xgb.importance <- function(feature_names = NULL, model = NULL, data = NULL, label = NULL, target = function(x) ( (x + label) == 2)){
if (!class(feature_names) %in% c("character", "NULL")) { if (!class(feature_names) %in% c("character", "NULL")) {
stop("feature_names: Has to be a vector of character or NULL if the model dump already contains feature name. Look at this function documentation to see where to get feature names.") stop("feature_names: Has to be a vector of character or NULL if the model already contains feature name. Look at this function documentation to see where to get feature names.")
} }
if (!(class(filename_dump) %in% c("character", "NULL") && length(filename_dump) <= 1)) { if (class(model) != "xgb.Booster") {
stop("filename_dump: Has to be a path to the model dump file.")
}
if (!class(model) %in% c("xgb.Booster", "NULL")) {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.") stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
} }
if((is.null(data) & !is.null(label)) |(!is.null(data) & is.null(label))) { if((is.null(data) & !is.null(label)) | (!is.null(data) & is.null(label))) {
stop("data/label: Provide the two arguments if you want co-occurence computation or none of them if you are not interested but not one of them only.") stop("data/label: Provide the two arguments if you want co-occurence computation or none of them if you are not interested but not one of them only.")
} }
@ -87,17 +73,24 @@ xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = N
if(sum(label == 0) / length(label) > 0.5) label <- as(label, "sparseVector") if(sum(label == 0) / length(label) > 0.5) label <- as(label, "sparseVector")
} }
if(is.null(model)){ treeDump <- function(feature_names, text, keepDetail){
text <- readLines(filename_dump) if(keepDetail) groupBy <- c("Feature", "Split", "MissingNo") else groupBy <- "Feature"
} else { xgb.model.dt.tree(feature_names = feature_names, text = text)[,"MissingNo" := Missing == No ][Feature != "Leaf",.(Gain = sum(Quality), Cover = sum(Cover), Frequency = .N), by = groupBy, with = T][,`:=`(Gain = Gain / sum(Gain), Cover = Cover / sum(Cover), Frequency = Frequency / sum(Frequency))][order(Gain, decreasing = T)]
text <- xgb.dump(model = model, with.stats = T)
} }
if(text[2] == "bias:"){ linearDump <- function(feature_names, text){
result <- readLines(filename_dump) %>% linearDump(feature_names, .) weights <- which(text == "weight:") %>% {a =. + 1; text[a:length(text)]} %>% as.numeric
if(is.null(feature_names)) feature_names <- seq(to = length(weights))
data.table(Feature = feature_names, Weight = weights)
}
model.text.dump <- xgb.dump(model = model, with.stats = T)
if(model.text.dump[2] == "bias:"){
result <- model.text.dump %>% linearDump(feature_names, .)
if(!is.null(data) | !is.null(label)) warning("data/label: these parameters should only be provided with decision tree based models.") if(!is.null(data) | !is.null(label)) warning("data/label: these parameters should only be provided with decision tree based models.")
} else { } else {
result <- treeDump(feature_names, text = text, keepDetail = !is.null(data)) result <- treeDump(feature_names, text = model.text.dump, keepDetail = !is.null(data))
# Co-occurence computation # Co-occurence computation
if(!is.null(data) & !is.null(label) & nrow(result) > 0) { if(!is.null(data) & !is.null(label) & nrow(result) > 0) {
@ -110,24 +103,12 @@ xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = N
d <- data[, result[,Feature], drop=FALSE] < as.numeric(result[,Split]) d <- data[, result[,Feature], drop=FALSE] < as.numeric(result[,Split])
apply(c & d, 2, . %>% target %>% sum) -> vec apply(c & d, 2, . %>% target %>% sum) -> vec
result <- result[, "RealCover":= as.numeric(vec), with = F][, "RealCover %" := RealCover / sum(label)][,MissingNo:=NULL] result <- result[, "RealCover" := as.numeric(vec), with = F][, "RealCover %" := RealCover / sum(label)][,MissingNo := NULL]
} }
} }
result result
} }
treeDump <- function(feature_names, text, keepDetail){
if(keepDetail) groupBy <- c("Feature", "Split", "MissingNo") else groupBy <- "Feature"
result <- xgb.model.dt.tree(feature_names = feature_names, text = text)[,"MissingNo":= Missing == No ][Feature!="Leaf",.(Gain = sum(Quality), Cover = sum(Cover), Frequence = .N), by = groupBy, with = T][,`:=`(Gain = Gain/sum(Gain), Cover = Cover/sum(Cover), Frequence = Frequence/sum(Frequence))][order(Gain, decreasing = T)]
result
}
linearDump <- function(feature_names, text){
which(text == "weight:") %>% {a=.+1;text[a:length(text)]} %>% as.numeric %>% data.table(Feature = feature_names, Weight = .)
}
# Avoid error messages during CRAN check. # Avoid error messages during CRAN check.
# The reason is that these variables are never declared # The reason is that these variables are never declared
# They are mainly column names inferred by Data.table... # They are mainly column names inferred by Data.table...

View File

@ -15,7 +15,6 @@
#' bst <- xgb.load('xgb.model') #' bst <- xgb.load('xgb.model')
#' pred <- predict(bst, test$data) #' pred <- predict(bst, test$data)
#' @export #' @export
#'
xgb.load <- function(modelfile) { xgb.load <- function(modelfile) {
if (is.null(modelfile)) if (is.null(modelfile))
stop("xgb.load: modelfile cannot be NULL") stop("xgb.load: modelfile cannot be NULL")

View File

@ -1,6 +1,6 @@
#' Convert tree model dump to data.table #' Parse boosted tree model text dump
#' #'
#' Read a tree model text dump and return a data.table. #' Parse a boosted tree model text dump and return a \code{data.table}.
#' #'
#' @importFrom data.table data.table #' @importFrom data.table data.table
#' @importFrom data.table set #' @importFrom data.table set
@ -12,20 +12,20 @@
#' @importFrom magrittr add #' @importFrom magrittr add
#' @importFrom stringr str_extract #' @importFrom stringr str_extract
#' @importFrom stringr str_split #' @importFrom stringr str_split
#' @importFrom stringr str_extract
#' @importFrom stringr str_trim #' @importFrom stringr str_trim
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}. #' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If the model already contains feature names, this argument should be \code{NULL} (default value).
#' @param filename_dump the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}). #' @param model object created by the \code{xgb.train} function.
#' @param model dump generated by the \code{xgb.train} function. Avoid the creation of a dump file. #' @param text \code{character} vector generated by the \code{xgb.dump} function. Model dump must include the gain per feature and per tree (parameter \code{with.stats = TRUE} in function \code{xgb.dump}).
#' @param text dump generated by the \code{xgb.dump} function. Avoid the creation of a dump file. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}). #' @param n_first_tree limit the plot to the \code{n} first trees. If set to \code{NULL}, all trees of the model are plotted. Performance can be low depending of the size of the model.
#' @param n_first_tree limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.
#' #'
#' @return A \code{data.table} of the features used in the model with their gain, cover and few other thing. #' @return A \code{data.table} of the features used in the model with their gain, cover and few other information.
#' #'
#' @details #' @details
#' General function to convert a text dump of tree model to a Matrix. The purpose is to help user to explore the model and get a better understanding of it. #' General function to convert a text dump of tree model to a \code{data.table}.
#' #'
#' The content of the \code{data.table} is organised that way: #' The purpose is to help user to explore the model and get a better understanding of it.
#'
#' The columns of the \code{data.table} are:
#' #'
#' \itemize{ #' \itemize{
#' \item \code{ID}: unique identifier of a node ; #' \item \code{ID}: unique identifier of a node ;
@ -37,56 +37,40 @@
#' \item \code{Quality}: it's the gain related to the split in this specific node ; #' \item \code{Quality}: it's the gain related to the split in this specific node ;
#' \item \code{Cover}: metric to measure the number of observation affected by the split ; #' \item \code{Cover}: metric to measure the number of observation affected by the split ;
#' \item \code{Tree}: ID of the tree. It is included in the main ID ; #' \item \code{Tree}: ID of the tree. It is included in the main ID ;
#' \item \code{Yes.X} or \code{No.X}: data related to the pointer in \code{Yes} or \code{No} column ; #' \item \code{Yes.Feature}, \code{No.Feature}, \code{Yes.Cover}, \code{No.Cover}, \code{Yes.Quality} and \code{No.Quality}: data related to the pointer in \code{Yes} or \code{No} column ;
#' } #' }
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' #'
#' #Both dataset are list with two items, a sparse matrix and labels #' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' #(labels = outcome column which will be learned).
#' #Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") #' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#' #'
#' #agaricus.test$data@@Dimnames[[2]] represents the column names of the sparse matrix. #' # agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.model.dt.tree(agaricus.train$data@@Dimnames[[2]], model = bst) #' xgb.model.dt.tree(feature_names = agaricus.train$data@@Dimnames[[2]], model = bst)
#' #'
#' @export #' @export
xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model = NULL, text = NULL, n_first_tree = NULL){ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL, n_first_tree = NULL){
if (!class(feature_names) %in% c("character", "NULL")) { if (!class(feature_names) %in% c("character", "NULL")) {
stop("feature_names: Has to be a vector of character or NULL if the model dump already contains feature name. Look at this function documentation to see where to get feature names.") stop("feature_names: Has to be a vector of character or NULL if the model dump already contains feature name. Look at this function documentation to see where to get feature names.")
} }
if (!(class(filename_dump) %in% c("character", "NULL") && length(filename_dump) <= 1)) {
stop("filename_dump: Has to be a character vector of size 1 representing the path to the model dump file.")
} else if (!is.null(filename_dump) && !file.exists(filename_dump)) {
stop("filename_dump: path to the model doesn't exist.")
} else if(is.null(filename_dump) && is.null(model) && is.null(text)){
stop("filename_dump & model & text: no path to dump model, no model, no text dump, have been provided.")
}
if (!class(model) %in% c("xgb.Booster", "NULL")) { if (class(model) != "xgb.Booster" & class(text) != "character") {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.") "model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.\n" %>%
} paste0("text: Has to be a vector of character or NULL if a path to the model dump has already been provided.") %>%
stop()
if (!class(text) %in% c("character", "NULL")) {
stop("text: Has to be a vector of character or NULL if a path to the model dump has already been provided.")
} }
if (!class(n_first_tree) %in% c("numeric", "NULL") | length(n_first_tree) > 1) { if (!class(n_first_tree) %in% c("numeric", "NULL") | length(n_first_tree) > 1) {
stop("n_first_tree: Has to be a numeric vector of size 1.") stop("n_first_tree: Has to be a numeric vector of size 1.")
} }
if(!is.null(model)){ if(is.null(text)){
text = xgb.dump(model = model, with.stats = T) text <- xgb.dump(model = model, with.stats = T)
} else if(!is.null(filename_dump)){
text <- readLines(filename_dump) %>% str_trim(side = "both")
} }
position <- str_match(text, "booster") %>% is.na %>% not %>% which %>% c(length(text)+1) position <- str_match(text, "booster") %>% is.na %>% not %>% which %>% c(length(text) + 1)
extract <- function(x, pattern) str_extract(x, pattern) %>% str_split("=") %>% lapply(function(x) x[2] %>% as.numeric) %>% unlist extract <- function(x, pattern) str_extract(x, pattern) %>% str_split("=") %>% lapply(function(x) x[2] %>% as.numeric) %>% unlist
@ -96,15 +80,15 @@ xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model
allTrees <- data.table() allTrees <- data.table()
anynumber_regex<-"[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?" anynumber_regex <- "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
for(i in 1:n_round){ for (i in 1:n_round){
tree <- text[(position[i]+1):(position[i+1]-1)] tree <- text[(position[i] + 1):(position[i + 1] - 1)]
# avoid tree made of a leaf only (no split) # avoid tree made of a leaf only (no split)
if(length(tree) <2) next if(length(tree) < 2) next
treeID <- i-1 treeID <- i - 1
notLeaf <- str_match(tree, "leaf") %>% is.na notLeaf <- str_match(tree, "leaf") %>% is.na
leaf <- notLeaf %>% not %>% tree[.] leaf <- notLeaf %>% not %>% tree[.]
@ -128,7 +112,7 @@ xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model
qualityLeaf <- extract(leaf, paste0("leaf=",anynumber_regex)) qualityLeaf <- extract(leaf, paste0("leaf=",anynumber_regex))
coverBranch <- extract(branch, "cover=\\d*\\.*\\d*") coverBranch <- extract(branch, "cover=\\d*\\.*\\d*")
coverLeaf <- extract(leaf, "cover=\\d*\\.*\\d*") coverLeaf <- extract(leaf, "cover=\\d*\\.*\\d*")
dt <- data.table(ID = c(idBranch, idLeaf), Feature = c(featureBranch, featureLeaf), Split = c(splitBranch, splitLeaf), Yes = c(yesBranch, yesLeaf), No = c(noBranch, noLeaf), Missing = c(missingBranch, missingLeaf), Quality = c(qualityBranch, qualityLeaf), Cover = c(coverBranch, coverLeaf))[order(ID)][,Tree:=treeID] dt <- data.table(ID = c(idBranch, idLeaf), Feature = c(featureBranch, featureLeaf), Split = c(splitBranch, splitLeaf), Yes = c(yesBranch, yesLeaf), No = c(noBranch, noLeaf), Missing = c(missingBranch, missingLeaf), Quality = c(qualityBranch, qualityLeaf), Cover = c(coverBranch, coverLeaf))[order(ID)][,Tree := treeID]
allTrees <- rbindlist(list(allTrees, dt), use.names = T, fill = F) allTrees <- rbindlist(list(allTrees, dt), use.names = T, fill = F)
} }
@ -166,4 +150,4 @@ xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model
# Avoid error messages during CRAN check. # Avoid error messages during CRAN check.
# The reason is that these variables are never declared # The reason is that these variables are never declared
# They are mainly column names inferred by Data.table... # They are mainly column names inferred by Data.table...
globalVariables(c("ID", "Tree", "Yes", ".", ".N", "Feature", "Cover", "Quality", "No", "Gain", "Frequence")) globalVariables(c("ID", "Tree", "Yes", ".", ".N", "Feature", "Cover", "Quality", "No", "Gain", "Frequency"))

View File

@ -0,0 +1,160 @@
#' Plot multiple graphs at the same time
#'
#' Plot multiple graph aligned by rows and columns.
#'
#' @importFrom data.table data.table
#' @param cols number of columns
#' @return NULL
multiplot <- function(..., cols = 1) {
plots <- list(...)
numPlots = length(plots)
layout <- matrix(seq(1, cols * ceiling(numPlots / cols)),
ncol = cols, nrow = ceiling(numPlots / cols))
if (numPlots == 1) {
print(plots[[1]])
} else {
grid::grid.newpage()
grid::pushViewport(grid::viewport(layout = grid::grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.table(which(layout == i, arr.ind = TRUE))
print(
plots[[i]], vp = grid::viewport(
layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col
)
)
}
}
}
#' Parse the graph to extract vector of edges
#' @param element igraph object containing the path from the root to the leaf.
edge.parser <- function(element) {
edges.vector <- igraph::as_ids(element)
t <- tail(edges.vector, n = 1)
l <- length(edges.vector)
list(t,l)
}
#' Extract path from root to leaf from data.table
#' @param dt.tree data.table containing the nodes and edges of the trees
get.paths.to.leaf <- function(dt.tree) {
dt.not.leaf.edges <-
dt.tree[Feature != "Leaf",.(ID, Yes, Tree)] %>% list(dt.tree[Feature != "Leaf",.(ID, No, Tree)]) %>% rbindlist(use.names = F)
trees <- dt.tree[,unique(Tree)]
paths <- list()
for (tree in trees) {
graph <-
igraph::graph_from_data_frame(dt.not.leaf.edges[Tree == tree])
paths.tmp <-
igraph::shortest_paths(graph, from = paste0(tree, "-0"), to = dt.tree[Tree == tree &
Feature == "Leaf", c(ID)])
paths <- c(paths, paths.tmp$vpath)
}
paths
}
#' Plot model trees deepness
#'
#' Generate a graph to plot the distribution of deepness among trees.
#'
#' @importFrom data.table data.table
#' @importFrom data.table rbindlist
#' @importFrom data.table setnames
#' @importFrom data.table :=
#' @importFrom magrittr %>%
#' @param model dump generated by the \code{xgb.train} function.
#'
#' @return Two graphs showing the distribution of the model deepness.
#'
#' @details
#' Display both the number of \code{leaf} and the distribution of \code{weighted observations}
#' by tree deepness level.
#'
#' The purpose of this function is to help the user to find the best trade-off to set
#' the \code{max.depth} and \code{min_child_weight} parameters according to the bias / variance trade-off.
#'
#' See \link{xgb.train} for more information about these parameters.
#'
#' The graph is made of two parts:
#'
#' \itemize{
#' \item Count: number of leaf per level of deepness;
#' \item Weighted cover: noramlized weighted cover per leaf (weighted number of instances).
#' }
#'
#' This function is inspired by the blog post \url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
#' eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
#' min_child_weight = 50)
#'
#' xgb.plot.deepness(model = bst)
#'
#' @export
xgb.plot.deepness <- function(model = NULL) {
if (!requireNamespace("ggplot2", quietly = TRUE)) {
stop("ggplot2 package is required for plotting the graph deepness.",
call. = FALSE)
}
if (!requireNamespace("igraph", quietly = TRUE)) {
stop("igraph package is required for plotting the graph deepness.",
call. = FALSE)
}
if (!requireNamespace("grid", quietly = TRUE)) {
stop("grid package is required for plotting the graph deepness.",
call. = FALSE)
}
if (class(model) != "xgb.Booster") {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
}
dt.tree <- xgb.model.dt.tree(model = model)
dt.edge.elements <- data.table()
paths <- get.paths.to.leaf(dt.tree)
dt.edge.elements <-
lapply(paths, edge.parser) %>% rbindlist %>% setnames(c("last.edge", "size")) %>%
merge(dt.tree, by.x = "last.edge", by.y = "ID") %>% rbind(dt.edge.elements)
dt.edge.summuize <-
dt.edge.elements[, .(.N, Cover = sum(Cover)), size][,Cover:= Cover / sum(Cover)]
p1 <-
ggplot2::ggplot(dt.edge.summuize) + ggplot2::geom_line(ggplot2::aes(x = size, y = N, group = 1)) +
ggplot2::xlab("") + ggplot2::ylab("Count") + ggplot2::ggtitle("Model complexity") +
ggplot2::theme(
plot.title = ggplot2::element_text(lineheight = 0.9, face = "bold"),
panel.grid.major.y = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank(),
axis.text.x = ggplot2::element_blank()
)
p2 <-
ggplot2::ggplot(dt.edge.summuize) + ggplot2::geom_line(ggplot2::aes(x =size, y = Cover, group = 1)) +
ggplot2::xlab("From root to leaf path length") + ggplot2::ylab("Weighted cover")
multiplot(p1,p2,cols = 1)
}
# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...
globalVariables(
c(
"Feature", "Count", "ggplot", "aes", "geom_bar", "xlab", "ylab", "ggtitle", "theme", "element_blank", "element_text", "ID", "Yes", "No", "Tree"
)
)

View File

@ -1,6 +1,6 @@
#' Plot feature importance bar graph #' Plot feature importance bar graph
#' #'
#' Read a data.table containing feature importance details and plot it. #' Read a data.table containing feature importance details and plot it (for both GLM and Trees).
#' #'
#' @importFrom magrittr %>% #' @importFrom magrittr %>%
#' @param importance_matrix a \code{data.table} returned by the \code{xgb.importance} function. #' @param importance_matrix a \code{data.table} returned by the \code{xgb.importance} function.
@ -10,7 +10,7 @@
#' #'
#' @details #' @details
#' The purpose of this function is to easily represent the importance of each feature of a model. #' The purpose of this function is to easily represent the importance of each feature of a model.
#' The function return a ggplot graph, therefore each of its characteristic can be overriden (to customize it). #' The function returns a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
#' In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function. #' In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function.
#' #'
#' @examples #' @examples
@ -19,39 +19,61 @@
#' #Both dataset are list with two items, a sparse matrix and labels #' #Both dataset are list with two items, a sparse matrix and labels
#' #(labels = outcome column which will be learned). #' #(labels = outcome column which will be learned).
#' #Each column of the sparse Matrix is a feature in one hot encoding format. #' #Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#' #'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2, #' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") #' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#' #'
#' #train$data@@Dimnames[[2]] represents the column names of the sparse matrix. #' #agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' importance_matrix <- xgb.importance(train$data@@Dimnames[[2]], model = bst) #' importance_matrix <- xgb.importance(agaricus.train$data@@Dimnames[[2]], model = bst)
#' xgb.plot.importance(importance_matrix) #' xgb.plot.importance(importance_matrix)
#' #'
#' @export #' @export
xgb.plot.importance <- function(importance_matrix = NULL, numberOfClusters = c(1:10)){ xgb.plot.importance <-
if (!"data.table" %in% class(importance_matrix)) { function(importance_matrix = NULL, numberOfClusters = c(1:10)) {
stop("importance_matrix: Should be a data.table.") if (!"data.table" %in% class(importance_matrix)) {
stop("importance_matrix: Should be a data.table.")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
stop("ggplot2 package is required for plotting the importance", call. = FALSE)
}
if (!requireNamespace("Ckmeans.1d.dp", quietly = TRUE)) {
stop("Ckmeans.1d.dp package is required for plotting the importance", call. = FALSE)
}
if(isTRUE(all.equal(colnames(importance_matrix), c("Feature", "Gain", "Cover", "Frequency")))){
y.axe.name <- "Gain"
} else if(isTRUE(all.equal(colnames(importance_matrix), c("Feature", "Weight")))){
y.axe.name <- "Weight"
} else {
stop("Importance matrix is not correct (column names issue)")
}
# To avoid issues in clustering when co-occurences are used
importance_matrix <-
importance_matrix[, .(Gain.or.Weight = sum(get(y.axe.name))), by = Feature]
clusters <-
suppressWarnings(Ckmeans.1d.dp::Ckmeans.1d.dp(importance_matrix[,Gain.or.Weight], numberOfClusters))
importance_matrix[,"Cluster":= clusters$cluster %>% as.character]
plot <-
ggplot2::ggplot(
importance_matrix, ggplot2::aes(
x = stats::reorder(Feature, Gain.or.Weight), y = Gain.or.Weight, width = 0.05
), environment = environment()
) + ggplot2::geom_bar(ggplot2::aes(fill = Cluster), stat = "identity", position =
"identity") + ggplot2::coord_flip() + ggplot2::xlab("Features") + ggplot2::ylab(y.axe.name) + ggplot2::ggtitle("Feature importance") + ggplot2::theme(
plot.title = ggplot2::element_text(lineheight = .9, face = "bold"), panel.grid.major.y = ggplot2::element_blank()
)
return(plot)
} }
if (!requireNamespace("ggplot2", quietly = TRUE)) {
stop("ggplot2 package is required for plotting the importance", call. = FALSE)
}
if (!requireNamespace("Ckmeans.1d.dp", quietly = TRUE)) {
stop("Ckmeans.1d.dp package is required for plotting the importance", call. = FALSE)
}
# To avoid issues in clustering when co-occurences are used
importance_matrix <- importance_matrix[, .(Gain = sum(Gain)), by = Feature]
clusters <- suppressWarnings(Ckmeans.1d.dp::Ckmeans.1d.dp(importance_matrix[,Gain], numberOfClusters))
importance_matrix[,"Cluster":=clusters$cluster %>% as.character]
plot <- ggplot2::ggplot(importance_matrix, ggplot2::aes(x=stats::reorder(Feature, Gain), y = Gain, width= 0.05), environment = environment())+ ggplot2::geom_bar(ggplot2::aes(fill=Cluster), stat="identity", position="identity") + ggplot2::coord_flip() + ggplot2::xlab("Features") + ggplot2::ylab("Gain") + ggplot2::ggtitle("Feature importance") + ggplot2::theme(plot.title = ggplot2::element_text(lineheight=.9, face="bold"), panel.grid.major.y = ggplot2::element_blank() )
return(plot)
}
# Avoid error messages during CRAN check. # Avoid error messages during CRAN check.
# The reason is that these variables are never declared # The reason is that these variables are never declared
# They are mainly column names inferred by Data.table... # They are mainly column names inferred by Data.table...
globalVariables(c("Feature", "Gain", "Cluster", "ggplot", "aes", "geom_bar", "coord_flip", "xlab", "ylab", "ggtitle", "theme", "element_blank", "element_text")) globalVariables(
c(
"Feature", "Gain.or.Weight", "Cluster", "ggplot", "aes", "geom_bar", "coord_flip", "xlab", "ylab", "ggtitle", "theme", "element_blank", "element_text", "Gain.or.Weight"
)
)

View File

@ -0,0 +1,114 @@
#' Project all trees on one tree and plot it
#'
#' Visualization of the ensemble of trees as a single collective unit.
#'
#' @importFrom data.table data.table
#' @importFrom data.table rbindlist
#' @importFrom data.table setnames
#' @importFrom data.table :=
#' @importFrom magrittr %>%
#' @importFrom stringr str_detect
#' @importFrom stringr str_extract
#'
#' @param model dump generated by the \code{xgb.train} function.
#' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param features.keep number of features to keep in each position of the multi trees.
#' @param plot.width width in pixels of the graph to produce
#' @param plot.height height in pixels of the graph to produce
#'
#' @return Two graphs showing the distribution of the model deepness.
#'
#' @details
#'
#' This function tries to capture the complexity of gradient boosted tree ensemble
#' in a cohesive way.
#'
#' The goal is to improve the interpretability of the model generally seen as black box.
#' The function is dedicated to boosting applied to decision trees only.
#'
#' The purpose is to move from an ensemble of trees to a single tree only.
#'
#' It takes advantage of the fact that the shape of a binary tree is only defined by
#' its deepness (therefore in a boosting model, all trees have the same shape).
#'
#' Moreover, the trees tend to reuse the same features.
#'
#' The function will project each tree on one, and keep for each position the
#' \code{features.keep} first features (based on Gain per feature measure).
#'
#' This function is inspired by this blog post:
#' \url{https://wellecks.wordpress.com/2015/02/21/peering-into-the-black-box-visualizing-lambdamart/}
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
#' eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
#' min_child_weight = 50)
#'
#' p <- xgb.plot.multi.trees(model = bst, feature_names = agaricus.train$data@Dimnames[[2]], features.keep = 3)
#' print(p)
#'
#' @export
xgb.plot.multi.trees <- function(model, feature_names = NULL, features.keep = 5, plot.width = NULL, plot.height = NULL){
tree.matrix <- xgb.model.dt.tree(feature_names = feature_names, model = model)
# first number of the path represents the tree, then the following numbers are related to the path to follow
# root init
root.nodes <- tree.matrix[str_detect(ID, "\\d+-0"), ID]
tree.matrix[ID %in% root.nodes, abs.node.position:=root.nodes]
precedent.nodes <- root.nodes
while(tree.matrix[,sum(is.na(abs.node.position))] > 0) {
yes.row.nodes <- tree.matrix[abs.node.position %in% precedent.nodes & !is.na(Yes)]
no.row.nodes <- tree.matrix[abs.node.position %in% precedent.nodes & !is.na(No)]
yes.nodes.abs.pos <- yes.row.nodes[, abs.node.position] %>% paste0("_0")
no.nodes.abs.pos <- no.row.nodes[, abs.node.position] %>% paste0("_1")
tree.matrix[ID %in% yes.row.nodes[, Yes], abs.node.position := yes.nodes.abs.pos]
tree.matrix[ID %in% no.row.nodes[, No], abs.node.position := no.nodes.abs.pos]
precedent.nodes <- c(yes.nodes.abs.pos, no.nodes.abs.pos)
}
tree.matrix[!is.na(Yes),Yes:= paste0(abs.node.position, "_0")]
tree.matrix[!is.na(No),No:= paste0(abs.node.position, "_1")]
remove.tree <- . %>% str_replace(pattern = "^\\d+-", replacement = "")
tree.matrix[,`:=`(abs.node.position=remove.tree(abs.node.position), Yes=remove.tree(Yes), No=remove.tree(No))]
nodes.dt <- tree.matrix[,.(Quality = sum(Quality)),by = .(abs.node.position, Feature)][,.(Text =paste0(Feature[1:min(length(Feature), features.keep)], " (", Quality[1:min(length(Quality), features.keep)], ")") %>% paste0(collapse = "\n")), by=abs.node.position]
edges.dt <- tree.matrix[Feature != "Leaf",.(abs.node.position, Yes)] %>% list(tree.matrix[Feature != "Leaf",.(abs.node.position, No)]) %>% rbindlist() %>% setnames(c("From", "To")) %>% .[,.N,.(From, To)] %>% .[,N:=NULL]
nodes <- DiagrammeR::create_nodes(nodes = nodes.dt[,abs.node.position],
label = nodes.dt[,Text],
style = "filled",
color = "DimGray",
fillcolor= "Beige",
shape = "oval",
fontname = "Helvetica"
)
edges <- DiagrammeR::create_edges(from = edges.dt[,From],
to = edges.dt[,To],
color = "DimGray",
arrowsize = "1.5",
arrowhead = "vee",
fontname = "Helvetica",
rel = "leading_to")
graph <- DiagrammeR::create_graph(nodes_df = nodes,
edges_df = edges,
graph_attrs = "rankdir = LR")
DiagrammeR::render_graph(graph, width = plot.width, height = plot.height)
}
globalVariables(
c(
"Feature", "no.nodes.abs.pos", "ID", "Yes", "No", "Tree", "yes.nodes.abs.pos", "abs.node.position"
)
)

View File

@ -1,27 +1,15 @@
#' Plot a boosted tree model #' Plot a boosted tree model
#' #'
#' Read a tree model text dump. #' Read a tree model text dump and plot the model.
#' Plotting only works for boosted tree model (not linear model).
#' #'
#' @importFrom data.table data.table #' @importFrom data.table data.table
#' @importFrom data.table set
#' @importFrom data.table rbindlist
#' @importFrom data.table := #' @importFrom data.table :=
#' @importFrom data.table copy
#' @importFrom magrittr %>% #' @importFrom magrittr %>%
#' @importFrom magrittr not #' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @importFrom magrittr add
#' @importFrom stringr str_extract
#' @importFrom stringr str_split
#' @importFrom stringr str_extract
#' @importFrom stringr str_trim
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param filename_dump the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}). Possible to provide a model directly (see \code{model} argument).
#' @param model generated by the \code{xgb.train} function. Avoid the creation of a dump file. #' @param model generated by the \code{xgb.train} function. Avoid the creation of a dump file.
#' @param n_first_tree limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models. #' @param n_first_tree limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.
#' @param CSSstyle a \code{character} vector storing a css style to customize the appearance of nodes. Look at the \href{https://github.com/knsv/mermaid/wiki}{Mermaid wiki} for more information. #' @param plot.width the width of the diagram in pixels.
#' @param width the width of the diagram in pixels. #' @param plot.height the height of the diagram in pixels.
#' @param height the height of the diagram in pixels.
#' #'
#' @return A \code{DiagrammeR} of the model. #' @return A \code{DiagrammeR} of the model.
#' #'
@ -30,37 +18,26 @@
#' The content of each node is organised that way: #' The content of each node is organised that way:
#' #'
#' \itemize{ #' \itemize{
#' \item \code{feature} value ; #' \item \code{feature} value;
#' \item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be ; #' \item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be;
#' \item \code{gain}: metric the importance of the node in the model. #' \item \code{gain}: metric the importance of the node in the model.
#' } #' }
#' #'
#' Each branch finishes with a leaf. For each leaf, only the \code{cover} is indicated. #' The function uses \href{http://www.graphviz.org/}{GraphViz} library for that purpose.
#' It uses \href{https://github.com/knsv/mermaid/}{Mermaid} library for that purpose.
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' #'
#' #Both dataset are list with two items, a sparse matrix and labels #' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' #(labels = outcome column which will be learned).
#' #Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") #' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#' #'
#' #agaricus.test$data@@Dimnames[[2]] represents the column names of the sparse matrix. #' # agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.plot.tree(agaricus.train$data@@Dimnames[[2]], model = bst) #' xgb.plot.tree(feature_names = agaricus.train$data@@Dimnames[[2]], model = bst)
#' #'
#' @export #' @export
#' xgb.plot.tree <- function(feature_names = NULL, model = NULL, n_first_tree = NULL, plot.width = NULL, plot.height = NULL){
xgb.plot.tree <- function(feature_names = NULL, filename_dump = NULL, model = NULL, n_first_tree = NULL, CSSstyle = NULL, width = NULL, height = NULL){
if (!(class(CSSstyle) %in% c("character", "NULL") && length(CSSstyle) <= 1)) { if (class(model) != "xgb.Booster") {
stop("style: Has to be a character vector of size 1.")
}
if (!class(model) %in% c("xgb.Booster", "NULL")) {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.") stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
} }
@ -68,30 +45,40 @@ xgb.plot.tree <- function(feature_names = NULL, filename_dump = NULL, model = NU
stop("DiagrammeR package is required for xgb.plot.tree", call. = FALSE) stop("DiagrammeR package is required for xgb.plot.tree", call. = FALSE)
} }
if(is.null(model)){ allTrees <- xgb.model.dt.tree(feature_names = feature_names, model = model, n_first_tree = n_first_tree)
allTrees <- xgb.model.dt.tree(feature_names = feature_names, filename_dump = filename_dump, n_first_tree = n_first_tree)
} else {
allTrees <- xgb.model.dt.tree(feature_names = feature_names, model = model, n_first_tree = n_first_tree)
}
allTrees[Feature!="Leaf" ,yesPath:= paste(ID,"(", Feature, "<br/>Cover: ", Cover, "<br/>Gain: ", Quality, ")-->|< ", Split, "|", Yes, ">", Yes.Feature, "]", sep = "")] allTrees[, label:= paste0(Feature, "\nCover: ", Cover, "\nGain: ", Quality)]
allTrees[, shape:= "rectangle"][Feature == "Leaf", shape:= "oval"]
allTrees[, filledcolor:= "Beige"][Feature == "Leaf", filledcolor:= "Khaki"]
allTrees[Feature!="Leaf" ,noPath:= paste(ID,"(", Feature, ")-->|>= ", Split, "|", No, ">", No.Feature, "]", sep = "")] # rev is used to put the first tree on top.
nodes <- DiagrammeR::create_nodes(nodes = allTrees[,ID] %>% rev,
label = allTrees[,label] %>% rev,
style = "filled",
color = "DimGray",
fillcolor= allTrees[,filledcolor] %>% rev,
shape = allTrees[,shape] %>% rev,
data = allTrees[,Feature] %>% rev,
fontname = "Helvetica"
)
edges <- DiagrammeR::create_edges(from = allTrees[Feature != "Leaf", c(ID)] %>% rep(2),
to = allTrees[Feature != "Leaf", c(Yes, No)],
label = allTrees[Feature != "Leaf", paste("<",Split)] %>% c(rep("",nrow(allTrees[Feature != "Leaf"]))),
color = "DimGray",
arrowsize = "1.5",
arrowhead = "vee",
fontname = "Helvetica",
rel = "leading_to")
if(is.null(CSSstyle)){ graph <- DiagrammeR::create_graph(nodes_df = nodes,
CSSstyle <- "classDef greenNode fill:#A2EB86, stroke:#04C4AB, stroke-width:2px;classDef redNode fill:#FFA070, stroke:#FF5E5E, stroke-width:2px" edges_df = edges,
} graph_attrs = "rankdir = LR")
yes <- allTrees[Feature!="Leaf", c(Yes)] %>% paste(collapse = ",") %>% paste("class ", ., " greenNode", sep = "") DiagrammeR::render_graph(graph, width = plot.width, height = plot.height)
no <- allTrees[Feature!="Leaf", c(No)] %>% paste(collapse = ",") %>% paste("class ", ., " redNode", sep = "")
path <- allTrees[Feature!="Leaf", c(yesPath, noPath)] %>% .[order(.)] %>% paste(sep = "", collapse = ";") %>% paste("graph LR", .,collapse = "", sep = ";") %>% paste(CSSstyle, yes, no, sep = ";")
DiagrammeR::mermaid(path, width, height)
} }
# Avoid error messages during CRAN check. # Avoid error messages during CRAN check.
# The reason is that these variables are never declared # The reason is that these variables are never declared
# They are mainly column names inferred by Data.table... # They are mainly column names inferred by Data.table...
globalVariables(c("Feature", "yesPath", "ID", "Cover", "Quality", "Split", "Yes", "Yes.Feature", "noPath", "No", "No.Feature", ".")) globalVariables(c("Feature", "ID", "Cover", "Quality", "Split", "Yes", "No", ".", "shape", "filledcolor", "label"))

View File

@ -16,7 +16,6 @@
#' bst <- xgb.load('xgb.model') #' bst <- xgb.load('xgb.model')
#' pred <- predict(bst, test$data) #' pred <- predict(bst, test$data)
#' @export #' @export
#'
xgb.save <- function(model, fname) { xgb.save <- function(model, fname) {
if (typeof(fname) != "character") { if (typeof(fname) != "character") {
stop("xgb.save: fname must be character") stop("xgb.save: fname must be character")

View File

@ -16,7 +16,6 @@
#' bst <- xgb.load(raw) #' bst <- xgb.load(raw)
#' pred <- predict(bst, test$data) #' pred <- predict(bst, test$data)
#' @export #' @export
#'
xgb.save.raw <- function(model) { xgb.save.raw <- function(model) {
if (class(model) == "xgb.Booster"){ if (class(model) == "xgb.Booster"){
model <- model$handle model <- model$handle

View File

@ -19,7 +19,7 @@
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3 #' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. #' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{max_depth} maximum depth of a tree. Default: 6 #' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 #' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1 #' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 #' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 #' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
@ -43,7 +43,7 @@
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability. #' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation. #' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{num_class} set the number of classes. To use only with multiclass objectives. #' \item \code{num_class} set the number of classes. To use only with multiclass objectives.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{tonum_class}. #' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class}.
#' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class. #' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss. #' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
#' } #' }
@ -89,6 +89,7 @@
#' \itemize{ #' \itemize{
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error} #' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood} #' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
#' \item \code{mlogloss} multiclass logloss. \url{https://www.kaggle.com/wiki/MultiClassLogLoss}
#' \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. #' \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. #' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation. #' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
@ -119,7 +120,6 @@
#' param <- list(max.depth = 2, eta = 1, silent = 1, objective=logregobj,eval_metric=evalerror) #' param <- list(max.depth = 2, eta = 1, silent = 1, objective=logregobj,eval_metric=evalerror)
#' bst <- xgb.train(param, dtrain, nthread = 2, nround = 2, watchlist) #' bst <- xgb.train(param, dtrain, nthread = 2, nround = 2, watchlist)
#' @export #' @export
#'
xgb.train <- function(params=list(), data, nrounds, watchlist = list(), xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
obj = NULL, feval = NULL, verbose = 1, print.every.n=1L, obj = NULL, feval = NULL, verbose = 1, print.every.n=1L,
early.stop.round = NULL, maximize = NULL, early.stop.round = NULL, maximize = NULL,
@ -140,27 +140,28 @@ xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
warning('watchlist is provided but verbose=0, no evaluation information will be printed') warning('watchlist is provided but verbose=0, no evaluation information will be printed')
} }
dot.params = list(...) fit.call <- match.call()
nms.params = names(params) dot.params <- list(...)
nms.dot.params = names(dot.params) nms.params <- names(params)
if (length(intersect(nms.params,nms.dot.params))>0) nms.dot.params <- names(dot.params)
if (length(intersect(nms.params,nms.dot.params)) > 0)
stop("Duplicated term in parameters. Please check your list of params.") stop("Duplicated term in parameters. Please check your list of params.")
params = append(params, dot.params) params <- append(params, dot.params)
# customized objective and evaluation metric interface # customized objective and evaluation metric interface
if (!is.null(params$objective) && !is.null(obj)) if (!is.null(params$objective) && !is.null(obj))
stop("xgb.train: cannot assign two different objectives") stop("xgb.train: cannot assign two different objectives")
if (!is.null(params$objective)) if (!is.null(params$objective))
if (class(params$objective)=='function') { if (class(params$objective) == 'function') {
obj = params$objective obj <- params$objective
params$objective = NULL params$objective <- NULL
} }
if (!is.null(params$eval_metric) && !is.null(feval)) if (!is.null(params$eval_metric) && !is.null(feval))
stop("xgb.train: cannot assign two different evaluation metrics") stop("xgb.train: cannot assign two different evaluation metrics")
if (!is.null(params$eval_metric)) if (!is.null(params$eval_metric))
if (class(params$eval_metric)=='function') { if (class(params$eval_metric) == 'function') {
feval = params$eval_metric feval <- params$eval_metric
params$eval_metric = NULL params$eval_metric <- NULL
} }
# Early stopping # Early stopping
@ -174,44 +175,43 @@ xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
if (is.null(maximize)) if (is.null(maximize))
{ {
if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) { if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) {
maximize = FALSE maximize <- FALSE
} else { } else {
maximize = TRUE maximize <- TRUE
} }
} }
if (maximize) { if (maximize) {
bestScore = 0 bestScore <- 0
} else { } else {
bestScore = Inf bestScore <- Inf
} }
bestInd = 0 bestInd <- 0
earlyStopflag = FALSE earlyStopflag = FALSE
if (length(watchlist)>1) if (length(watchlist) > 1)
warning('Only the first data set in watchlist is used for early stopping process.') warning('Only the first data set in watchlist is used for early stopping process.')
} }
handle <- xgb.Booster(params, append(watchlist, dtrain)) handle <- xgb.Booster(params, append(watchlist, dtrain))
bst <- xgb.handleToBooster(handle) bst <- xgb.handleToBooster(handle)
print.every.n=max( as.integer(print.every.n), 1L) print.every.n <- max( as.integer(print.every.n), 1L)
for (i in 1:nrounds) { for (i in 1:nrounds) {
succ <- xgb.iter.update(bst$handle, dtrain, i - 1, obj) succ <- xgb.iter.update(bst$handle, dtrain, i - 1, obj)
if (length(watchlist) != 0) { if (length(watchlist) != 0) {
msg <- xgb.iter.eval(bst$handle, watchlist, i - 1, feval) msg <- xgb.iter.eval(bst$handle, watchlist, i - 1, feval)
if (0== ( (i-1) %% print.every.n)) if (0 == ( (i - 1) %% print.every.n))
cat(paste(msg, "\n", sep="")) cat(paste(msg, "\n", sep = ""))
if (!is.null(early.stop.round)) if (!is.null(early.stop.round))
{ {
score = strsplit(msg,':|\\s+')[[1]][3] score <- strsplit(msg,':|\\s+')[[1]][3]
score = as.numeric(score) score <- as.numeric(score)
if ((maximize && score>bestScore) || (!maximize && score<bestScore)) { if ( (maximize && score > bestScore) || (!maximize && score < bestScore)) {
bestScore = score bestScore <- score
bestInd = i bestInd <- i
} else { } else {
if (i-bestInd>=early.stop.round) { earlyStopflag = TRUE
earlyStopflag = TRUE if (i - bestInd >= early.stop.round) {
cat('Stopping. Best iteration:',bestInd) cat('Stopping. Best iteration:',bestInd)
break break
} }
@ -225,9 +225,13 @@ xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
} }
} }
bst <- xgb.Booster.check(bst) bst <- xgb.Booster.check(bst)
if (!is.null(early.stop.round)) { if (!is.null(early.stop.round)) {
bst$bestScore = bestScore bst$bestScore <- bestScore
bst$bestInd = bestInd bst$bestInd <- bestInd
} }
attr(bst, "call") <- fit.call
attr(bst, "params") <- params
return(bst) return(bst)
} }

View File

@ -58,8 +58,7 @@
#' pred <- predict(bst, test$data) #' pred <- predict(bst, test$data)
#' #'
#' @export #' @export
#' xgboost <- function(data = NULL, label = NULL, missing = NA, weight = NULL,
xgboost <- function(data = NULL, label = NULL, missing = NULL, weight = NULL,
params = list(), nrounds, params = list(), nrounds,
verbose = 1, print.every.n = 1L, early.stop.round = NULL, verbose = 1, print.every.n = 1L, early.stop.round = NULL,
maximize = NULL, save_period = 0, save_name = "xgboost.model", ...) { maximize = NULL, save_period = 0, save_name = "xgboost.model", ...) {
@ -79,8 +78,6 @@ xgboost <- function(data = NULL, label = NULL, missing = NULL, weight = NULL,
return(bst) return(bst)
} }
#' Training part from Mushroom Data Set #' Training part from Mushroom Data Set
#' #'
#' This data set is originally from the Mushroom data set, #' This data set is originally from the Mushroom data set,

View File

@ -14,28 +14,28 @@ class(train$data)
# this is the basic usage of xgboost you can put matrix in data field # this is the basic usage of xgboost you can put matrix in data field
# note: we are putting in sparse matrix here, xgboost naturally handles sparse input # note: we are putting in sparse matrix here, xgboost naturally handles sparse input
# use sparse matrix when your feature is sparse(e.g. when you are using one-hot encoding vector) # use sparse matrix when your feature is sparse(e.g. when you are using one-hot encoding vector)
print("training xgboost with sparseMatrix") print("Training xgboost with sparseMatrix")
bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic") nthread = 2, objective = "binary:logistic")
# alternatively, you can put in dense matrix, i.e. basic R-matrix # alternatively, you can put in dense matrix, i.e. basic R-matrix
print("training xgboost with Matrix") print("Training xgboost with Matrix")
bst <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, bst <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic") nthread = 2, objective = "binary:logistic")
# you can also put in xgb.DMatrix object, which stores label, data and other meta datas needed for advanced features # you can also put in xgb.DMatrix object, which stores label, data and other meta datas needed for advanced features
print("training xgboost with xgb.DMatrix") print("Training xgboost with xgb.DMatrix")
dtrain <- xgb.DMatrix(data = train$data, label = train$label) dtrain <- xgb.DMatrix(data = train$data, label = train$label)
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, nthread = 2, bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, nthread = 2,
objective = "binary:logistic") objective = "binary:logistic")
# Verbose = 0,1,2 # Verbose = 0,1,2
print ('train xgboost with verbose 0, no message') print("Train xgboost with verbose 0, no message")
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic", verbose = 0) nthread = 2, objective = "binary:logistic", verbose = 0)
print ('train xgboost with verbose 1, print evaluation metric') print("Train xgboost with verbose 1, print evaluation metric")
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic", verbose = 1) nthread = 2, objective = "binary:logistic", verbose = 1)
print ('train xgboost with verbose 2, also print information about tree') print("Train xgboost with verbose 2, also print information about tree")
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic", verbose = 2) nthread = 2, objective = "binary:logistic", verbose = 2)
@ -76,11 +76,11 @@ dtest <- xgb.DMatrix(data = test$data, label=test$label)
watchlist <- list(train=dtrain, test=dtest) watchlist <- list(train=dtrain, test=dtest)
# to train with watchlist, use xgb.train, which contains more advanced features # to train with watchlist, use xgb.train, which contains more advanced features
# watchlist allows us to monitor the evaluation result on all data in the list # watchlist allows us to monitor the evaluation result on all data in the list
print ('train xgboost using xgb.train with watchlist') print("Train xgboost using xgb.train with watchlist")
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist,
nthread = 2, objective = "binary:logistic") nthread = 2, objective = "binary:logistic")
# we can change evaluation metrics, or use multiple evaluation metrics # we can change evaluation metrics, or use multiple evaluation metrics
print ('train xgboost using xgb.train with watchlist, watch logloss and error') print("train xgboost using xgb.train with watchlist, watch logloss and error")
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist,
eval.metric = "error", eval.metric = "logloss", eval.metric = "error", eval.metric = "logloss",
nthread = 2, objective = "binary:logistic") nthread = 2, objective = "binary:logistic")
@ -102,4 +102,9 @@ xgb.dump(bst, "dump.raw.txt", with.stats = T)
# Finally, you can check which features are the most important. # Finally, you can check which features are the most important.
print("Most important features (look at column Gain):") print("Most important features (look at column Gain):")
print(xgb.importance(feature_names = train$data@Dimnames[[2]], filename_dump = "dump.raw.txt")) imp_matrix <- xgb.importance(feature_names = train$data@Dimnames[[2]], model = bst)
print(imp_matrix)
# Feature importance bar plot by gain
print("Feature importance Plot : ")
print(xgb.plot.importance(importance_matrix = imp_matrix))

View File

@ -23,4 +23,4 @@ setinfo(dtrain, "base_margin", ptrain)
setinfo(dtest, "base_margin", ptest) setinfo(dtest, "base_margin", ptest)
print('this is result of boost from initial prediction') print('this is result of boost from initial prediction')
bst <- xgb.train( param, dtrain, 1, watchlist ) bst <- xgb.train(params = param, data = dtrain, nrounds = 1, watchlist = watchlist)

View File

@ -67,10 +67,9 @@ output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y]
cat("Learning...\n") cat("Learning...\n")
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 9, bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 9,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic") eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
xgb.dump(bst, 'xgb.model.dump', with.stats = T)
# sparse_matrix@Dimnames[[2]] represents the column names of the sparse matrix. # sparse_matrix@Dimnames[[2]] represents the column names of the sparse matrix.
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], 'xgb.model.dump') importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
print(importance) print(importance)
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column). # According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).

View File

@ -43,9 +43,9 @@ evalerror <- function(preds, dtrain) {
param <- list(max.depth=2,eta=1,silent=1, param <- list(max.depth=2,eta=1,silent=1,
objective = logregobj, eval_metric = evalerror) objective = logregobj, eval_metric = evalerror)
# train with customized objective # train with customized objective
xgb.cv(param, dtrain, nround, nfold = 5) xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5)
# do cross validation with prediction values for each fold # do cross validation with prediction values for each fold
res <- xgb.cv(param, dtrain, nround, nfold=5, prediction = TRUE) res <- xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5, prediction = TRUE)
res$dt res$dt
length(res$pred) length(res$pred)

View File

@ -1,21 +1,52 @@
require(xgboost) require(xgboost)
require(data.table)
require(Matrix)
set.seed(1982)
# load in the agaricus dataset # load in the agaricus dataset
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost') data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label) dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
param <- list(max.depth=2,eta=1,silent=1,objective='binary:logistic') param <- list(max.depth=2, eta=1, silent=1, objective='binary:logistic')
watchlist <- list(eval = dtest, train = dtrain) nround = 4
nround = 5
# training the model for two rounds # training the model for two rounds
bst = xgb.train(param, dtrain, nround, nthread = 2, watchlist) bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
cat('start testing prediction from first n trees\n')
# Model accuracy without new features
accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
### predict using first 2 tree
pred_with_leaf = predict(bst, dtest, ntreelimit = 2, predleaf = TRUE)
head(pred_with_leaf)
# by default, we predict using all the trees # by default, we predict using all the trees
pred_with_leaf = predict(bst, dtest, predleaf = TRUE) pred_with_leaf = predict(bst, dtest, predleaf = TRUE)
head(pred_with_leaf) head(pred_with_leaf)
create.new.tree.features <- function(model, original.features){
pred_with_leaf <- predict(model, original.features, predleaf = TRUE)
cols <- list()
for(i in 1:length(trees)){
# max is not the real max but it s not important for the purpose of adding features
leaf.id <- sort(unique(pred_with_leaf[,i]))
cols[[i]] <- factor(x = pred_with_leaf[,i], level = leaf.id)
}
cBind(original.features, sparse.model.matrix( ~ . -1, as.data.frame(cols)))
}
# Convert previous features to one hot encoding
new.features.train <- create.new.tree.features(bst, agaricus.train$data)
new.features.test <- create.new.tree.features(bst, agaricus.test$data)
# learning with new features
new.dtrain <- xgb.DMatrix(data = new.features.train, label = agaricus.train$label)
new.dtest <- xgb.DMatrix(data = new.features.test, label = agaricus.test$label)
watchlist <- list(train = new.dtrain)
bst <- xgb.train(params = param, data = new.dtrain, nrounds = nround, nthread = 2)
# Model accuracy with new features
accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
# Here the accuracy was already good and is now perfect.
cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\n"))

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgboost.R % Please edit documentation in R/xgboost.R
\docType{data} \docType{data}
\name{agaricus.test} \name{agaricus.test}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgboost.R % Please edit documentation in R/xgboost.R
\docType{data} \docType{data}
\name{agaricus.train} \name{agaricus.train}

View File

@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{edge.parser}
\alias{edge.parser}
\title{Parse the graph to extract vector of edges}
\usage{
edge.parser(element)
}
\arguments{
\item{element}{igraph object containing the path from the root to the leaf.}
}
\description{
Parse the graph to extract vector of edges
}

View File

@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{get.paths.to.leaf}
\alias{get.paths.to.leaf}
\title{Extract path from root to leaf from data.table}
\usage{
get.paths.to.leaf(dt.tree)
}
\arguments{
\item{dt.tree}{data.table containing the nodes and edges of the trees}
}
\description{
Extract path from root to leaf from data.table
}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/getinfo.xgb.DMatrix.R % Please edit documentation in R/getinfo.xgb.DMatrix.R
\docType{methods} \docType{methods}
\name{getinfo} \name{getinfo}

View File

@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{multiplot}
\alias{multiplot}
\title{Plot multiple graphs at the same time}
\usage{
multiplot(..., cols = 1)
}
\arguments{
\item{cols}{number of columns}
}
\description{
Plot multiple graph aligned by rows and columns.
}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/nrow.xgb.DMatrix.R % Please edit documentation in R/nrow.xgb.DMatrix.R
\docType{methods} \docType{methods}
\name{nrow,xgb.DMatrix-method} \name{nrow,xgb.DMatrix-method}
@ -18,5 +18,6 @@ data(agaricus.train, package='xgboost')
train <- agaricus.train train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label) dtrain <- xgb.DMatrix(train$data, label=train$label)
stopifnot(nrow(dtrain) == nrow(train$data)) stopifnot(nrow(dtrain) == nrow(train$data))
} }

View File

@ -1,11 +1,11 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/predict.xgb.Booster.R % Please edit documentation in R/predict.xgb.Booster.R
\docType{methods} \docType{methods}
\name{predict,xgb.Booster-method} \name{predict,xgb.Booster-method}
\alias{predict,xgb.Booster-method} \alias{predict,xgb.Booster-method}
\title{Predict method for eXtreme Gradient Boosting model} \title{Predict method for eXtreme Gradient Boosting model}
\usage{ \usage{
\S4method{predict}{xgb.Booster}(object, newdata, missing = NULL, \S4method{predict}{xgb.Booster}(object, newdata, missing = NA,
outputmargin = FALSE, ntreelimit = NULL, predleaf = FALSE) outputmargin = FALSE, ntreelimit = NULL, predleaf = FALSE)
} }
\arguments{ \arguments{
@ -31,6 +31,16 @@ than 0. It will use all trees by default.}
\description{ \description{
Predicted values based on xgboost model object. Predicted values based on xgboost model object.
} }
\details{
The option \code{ntreelimit} purpose is to let the user train a model with lots
of trees but use only the first trees for prediction to avoid overfitting
(without having to train a new model with less trees).
The option \code{predleaf} purpose is inspired from §3.1 of the paper
\code{Practical Lessons from Predicting Clicks on Ads at Facebook}.
The idea is to use the model as a generator of new features which capture non linear link
from original features.
}
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost') data(agaricus.test, package='xgboost')

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/predict.xgb.Booster.handle.R % Please edit documentation in R/predict.xgb.Booster.handle.R
\docType{methods} \docType{methods}
\name{predict,xgb.Booster.handle-method} \name{predict,xgb.Booster.handle-method}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/setinfo.xgb.DMatrix.R % Please edit documentation in R/setinfo.xgb.DMatrix.R
\docType{methods} \docType{methods}
\name{setinfo} \name{setinfo}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/slice.xgb.DMatrix.R % Please edit documentation in R/slice.xgb.DMatrix.R
\docType{methods} \docType{methods}
\name{slice} \name{slice}

View File

@ -1,10 +1,10 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.DMatrix.R % Please edit documentation in R/xgb.DMatrix.R
\name{xgb.DMatrix} \name{xgb.DMatrix}
\alias{xgb.DMatrix} \alias{xgb.DMatrix}
\title{Contruct xgb.DMatrix object} \title{Contruct xgb.DMatrix object}
\usage{ \usage{
xgb.DMatrix(data, info = list(), missing = 0, ...) xgb.DMatrix(data, info = list(), missing = NA, ...)
} }
\arguments{ \arguments{
\item{data}{a \code{matrix} object, a \code{dgCMatrix} object or a character \item{data}{a \code{matrix} object, a \code{dgCMatrix} object or a character

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.DMatrix.save.R % Please edit documentation in R/xgb.DMatrix.save.R
\name{xgb.DMatrix.save} \name{xgb.DMatrix.save}
\alias{xgb.DMatrix.save} \alias{xgb.DMatrix.save}

View File

@ -0,0 +1,88 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.create.features.R
\name{xgb.create.features}
\alias{xgb.create.features}
\title{Create new features from a previously learned model}
\usage{
xgb.create.features(model, training.data)
}
\arguments{
\item{model}{decision tree boosting model learned on the original data}
\item{training.data}{original data (usually provided as a \code{dgCMatrix} matrix)}
}
\value{
\code{dgCMatrix} matrix including both the original data and the new features.
}
\description{
May improve the learning by adding new features to the training data based on the decision trees from a previously learned model.
}
\details{
This is the function inspired from the paragraph 3.1 of the paper:
\strong{Practical Lessons from Predicting Clicks on Ads at Facebook}
\emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers,
Joaquin Quiñonero Candela)}
International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
\url{https://research.facebook.com/publications/758569837499391/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
Extract explaining the method:
"\emph{We found that boosted decision trees are a powerful and very
convenient way to implement non-linear and tuple transformations
of the kind we just described. We treat each individual
tree as a categorical feature that takes as value the
index of the leaf an instance ends up falling in. We use
1-of-K coding of this type of features.
For example, consider the boosted tree model in Figure 1 with 2 subtrees,
where the first subtree has 3 leafs and the second 2 leafs. If an
instance ends up in leaf 2 in the first subtree and leaf 1 in
second subtree, the overall input to the linear classifier will
be the binary vector \code{[0, 1, 0, 1, 0]}, where the first 3 entries
correspond to the leaves of the first subtree and last 2 to
those of the second subtree.
[...]
We can understand boosted decision tree
based transformation as a supervised feature encoding that
converts a real-valued vector into a compact binary-valued
vector. A traversal from root node to a leaf node represents
a rule on certain features.}"
}
\examples{
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
param <- list(max.depth=2, eta=1, silent=1, objective='binary:logistic')
nround = 4
bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
# Model accuracy without new features
accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
# Convert previous features to one hot encoding
new.features.train <- xgb.create.features(model = bst, agaricus.train$data)
new.features.test <- xgb.create.features(model = bst, agaricus.test$data)
# learning with new features
new.dtrain <- xgb.DMatrix(data = new.features.train, label = agaricus.train$label)
new.dtest <- xgb.DMatrix(data = new.features.test, label = agaricus.test$label)
watchlist <- list(train = new.dtrain)
bst <- xgb.train(params = param, data = new.dtrain, nrounds = nround, nthread = 2)
# Model accuracy with new features
accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
# Here the accuracy was already good and is now perfect.
cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\\n"))
}

View File

@ -1,14 +1,13 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.cv.R % Please edit documentation in R/xgb.cv.R
\name{xgb.cv} \name{xgb.cv}
\alias{xgb.cv} \alias{xgb.cv}
\title{Cross Validation} \title{Cross Validation}
\usage{ \usage{
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA,
missing = NULL, prediction = FALSE, showsd = TRUE, metrics = list(), prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL,
obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T,
verbose = T, print.every.n = 1L, early.stop.round = NULL, print.every.n = 1L, early.stop.round = NULL, maximize = NULL, ...)
maximize = NULL, ...)
} }
\arguments{ \arguments{
\item{params}{the list of parameters. Commonly used ones are: \item{params}{the list of parameters. Commonly used ones are:
@ -41,7 +40,7 @@ value that represents missing value. Sometime a data use 0 or other extreme valu
\item{showsd}{\code{boolean}, whether show standard deviation of cross validation} \item{showsd}{\code{boolean}, whether show standard deviation of cross validation}
\item{metrics,}{list of evaluation metrics to be used in corss validation, \item{metrics, }{list of evaluation metrics to be used in corss validation,
when it is not specified, the evaluation metric is chosen according to objective function. when it is not specified, the evaluation metric is chosen according to objective function.
Possible options are: Possible options are:
\itemize{ \itemize{
@ -73,7 +72,7 @@ If set to an integer \code{k}, training with a validation set will stop if the p
keeps getting worse consecutively for \code{k} rounds.} keeps getting worse consecutively for \code{k} rounds.}
\item{maximize}{If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well. \item{maximize}{If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
\code{maximize=TRUE} means the larger the evaluation score the better.} \code{maximize=TRUE} means the larger the evaluation score the better.}
\item{...}{other parameters to pass to \code{params}.} \item{...}{other parameters to pass to \code{params}.}
} }

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.dump.R % Please edit documentation in R/xgb.dump.R
\name{xgb.dump} \name{xgb.dump}
\alias{xgb.dump} \alias{xgb.dump}
@ -19,9 +19,9 @@ See demo/ for walkthrough example in R, and
for example Format.} for example Format.}
\item{with.stats}{whether dump statistics of splits \item{with.stats}{whether dump statistics of splits
When this option is on, the model dump comes with two additional statistics: When this option is on, the model dump comes with two additional statistics:
gain is the approximate loss function gain we get in each split; gain is the approximate loss function gain we get in each split;
cover is the sum of second order gradient in each node.} cover is the sum of second order gradient in each node.}
} }
\value{ \value{
if fname is not provided or set to \code{NULL} the function will return the model as a \code{character} vector. Otherwise it will return \code{TRUE}. if fname is not provided or set to \code{NULL} the function will return the model as a \code{character} vector. Otherwise it will return \code{TRUE}.

View File

@ -1,18 +1,16 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.importance.R % Please edit documentation in R/xgb.importance.R
\name{xgb.importance} \name{xgb.importance}
\alias{xgb.importance} \alias{xgb.importance}
\title{Show importance of features in a model} \title{Show importance of features in a model}
\usage{ \usage{
xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL, xgb.importance(feature_names = NULL, model = NULL, data = NULL,
data = NULL, label = NULL, target = function(x) ((x + label) == 2)) label = NULL, target = function(x) ((x + label) == 2))
} }
\arguments{ \arguments{
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.} \item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (\code{with.stats = T} in function \code{xgb.dump}).} \item{model}{generated by the \code{xgb.train} function.}
\item{model}{generated by the \code{xgb.train} function. Avoid the creation of a dump file.}
\item{data}{the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.} \item{data}{the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.}
@ -24,23 +22,24 @@ xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL,
A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model. A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
} }
\description{ \description{
Read a xgboost model text dump. Create a \code{data.table} of the most important features of a model.
Can be tree or linear model (text dump of linear model are only supported in dev version of \code{Xgboost} for now).
} }
\details{ \details{
This is the function to understand the model trained (and through your model, your data). This function is for both linear and tree models.
Results are returned for both linear and tree models.
\code{data.table} is returned by the function. \code{data.table} is returned by the function.
There are 3 columns : The columns are :
\itemize{ \itemize{
\item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump. \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump;
\item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training ; \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training (only available for tree models);
\item \code{Cover} metric of the number of observation related to this feature (only available for tree models) ; \item \code{Cover} metric of the number of observation related to this feature (only available for tree models);
\item \code{Weight} percentage representing the relative number of times a feature have been taken into trees. \code{Gain} should be prefered to search the most important feature. For boosted linear model, this column has no meaning. \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees.
} }
If you don't provide \code{feature_names}, index of the features will be used instead.
Because the index is extracted from the model dump (made on the C++ side), it starts at 0 (usual in C++) instead of 1 (usual in R).
Co-occurence count Co-occurence count
------------------ ------------------
@ -53,18 +52,14 @@ If you need to remember one thing only: until you want to leave us early, don't
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
# Both dataset are list with two items, a sparse matrix and labels bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
# (labels = outcome column which will be learned).
# Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
# train$data@Dimnames[[2]] represents the column names of the sparse matrix. # agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.importance(train$data@Dimnames[[2]], model = bst) xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst)
# Same thing with co-occurence computation this time # Same thing with co-occurence computation this time
xgb.importance(train$data@Dimnames[[2]], model = bst, data = train$data, label = train$label) xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst, data = agaricus.train$data, label = agaricus.train$label)
} }

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.load.R % Please edit documentation in R/xgb.load.R
\name{xgb.load} \name{xgb.load}
\alias{xgb.load} \alias{xgb.load}

View File

@ -1,33 +1,33 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.model.dt.tree.R % Please edit documentation in R/xgb.model.dt.tree.R
\name{xgb.model.dt.tree} \name{xgb.model.dt.tree}
\alias{xgb.model.dt.tree} \alias{xgb.model.dt.tree}
\title{Convert tree model dump to data.table} \title{Parse boosted tree model text dump}
\usage{ \usage{
xgb.model.dt.tree(feature_names = NULL, filename_dump = NULL, xgb.model.dt.tree(feature_names = NULL, model = NULL, text = NULL,
model = NULL, text = NULL, n_first_tree = NULL) n_first_tree = NULL)
} }
\arguments{ \arguments{
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.} \item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If the model already contains feature names, this argument should be \code{NULL} (default value).}
\item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}).} \item{model}{object created by the \code{xgb.train} function.}
\item{model}{dump generated by the \code{xgb.train} function. Avoid the creation of a dump file.} \item{text}{\code{character} vector generated by the \code{xgb.dump} function. Model dump must include the gain per feature and per tree (parameter \code{with.stats = TRUE} in function \code{xgb.dump}).}
\item{text}{dump generated by the \code{xgb.dump} function. Avoid the creation of a dump file. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}).} \item{n_first_tree}{limit the plot to the \code{n} first trees. If set to \code{NULL}, all trees of the model are plotted. Performance can be low depending of the size of the model.}
\item{n_first_tree}{limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.}
} }
\value{ \value{
A \code{data.table} of the features used in the model with their gain, cover and few other thing. A \code{data.table} of the features used in the model with their gain, cover and few other information.
} }
\description{ \description{
Read a tree model text dump and return a data.table. Parse a boosted tree model text dump and return a \code{data.table}.
} }
\details{ \details{
General function to convert a text dump of tree model to a Matrix. The purpose is to help user to explore the model and get a better understanding of it. General function to convert a text dump of tree model to a \code{data.table}.
The content of the \code{data.table} is organised that way: The purpose is to help user to explore the model and get a better understanding of it.
The columns of the \code{data.table} are:
\itemize{ \itemize{
\item \code{ID}: unique identifier of a node ; \item \code{ID}: unique identifier of a node ;
@ -39,21 +39,17 @@ The content of the \code{data.table} is organised that way:
\item \code{Quality}: it's the gain related to the split in this specific node ; \item \code{Quality}: it's the gain related to the split in this specific node ;
\item \code{Cover}: metric to measure the number of observation affected by the split ; \item \code{Cover}: metric to measure the number of observation affected by the split ;
\item \code{Tree}: ID of the tree. It is included in the main ID ; \item \code{Tree}: ID of the tree. It is included in the main ID ;
\item \code{Yes.X} or \code{No.X}: data related to the pointer in \code{Yes} or \code{No} column ; \item \code{Yes.Feature}, \code{No.Feature}, \code{Yes.Cover}, \code{No.Cover}, \code{Yes.Quality} and \code{No.Quality}: data related to the pointer in \code{Yes} or \code{No} column ;
} }
} }
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix. # agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst) xgb.model.dt.tree(feature_names = agaricus.train$data@Dimnames[[2]], model = bst)
} }

View File

@ -0,0 +1,46 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{xgb.plot.deepness}
\alias{xgb.plot.deepness}
\title{Plot model trees deepness}
\usage{
xgb.plot.deepness(model = NULL)
}
\arguments{
\item{model}{dump generated by the \code{xgb.train} function.}
}
\value{
Two graphs showing the distribution of the model deepness.
}
\description{
Generate a graph to plot the distribution of deepness among trees.
}
\details{
Display both the number of \code{leaf} and the distribution of \code{weighted observations}
by tree deepness level.
The purpose of this function is to help the user to find the best trade-off to set
the \code{max.depth} and \code{min_child_weight} parameters according to the bias / variance trade-off.
See \link{xgb.train} for more information about these parameters.
The graph is made of two parts:
\itemize{
\item Count: number of leaf per level of deepness;
\item Weighted cover: noramlized weighted cover per leaf (weighted number of instances).
}
This function is inspired by the blog post \url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
min_child_weight = 50)
xgb.plot.deepness(model = bst)
}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.importance.R % Please edit documentation in R/xgb.plot.importance.R
\name{xgb.plot.importance} \name{xgb.plot.importance}
\alias{xgb.plot.importance} \alias{xgb.plot.importance}
@ -15,11 +15,11 @@ xgb.plot.importance(importance_matrix = NULL, numberOfClusters = c(1:10))
A \code{ggplot2} bar graph representing each feature by a horizontal bar. Longer is the bar, more important is the feature. Features are classified by importance and clustered by importance. The group is represented through the color of the bar. A \code{ggplot2} bar graph representing each feature by a horizontal bar. Longer is the bar, more important is the feature. Features are classified by importance and clustered by importance. The group is represented through the color of the bar.
} }
\description{ \description{
Read a data.table containing feature importance details and plot it. Read a data.table containing feature importance details and plot it (for both GLM and Trees).
} }
\details{ \details{
The purpose of this function is to easily represent the importance of each feature of a model. The purpose of this function is to easily represent the importance of each feature of a model.
The function return a ggplot graph, therefore each of its characteristic can be overriden (to customize it). The function returns a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function. In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function.
} }
\examples{ \examples{
@ -28,13 +28,13 @@ data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels #Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned). #(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format. #Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2, bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#train$data@Dimnames[[2]] represents the column names of the sparse matrix. #agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
importance_matrix <- xgb.importance(train$data@Dimnames[[2]], model = bst) importance_matrix <- xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst)
xgb.plot.importance(importance_matrix) xgb.plot.importance(importance_matrix)
} }

View File

@ -0,0 +1,58 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.multi.trees.R
\name{xgb.plot.multi.trees}
\alias{xgb.plot.multi.trees}
\title{Project all trees on one tree and plot it}
\usage{
xgb.plot.multi.trees(model, feature_names = NULL, features.keep = 5,
plot.width = NULL, plot.height = NULL)
}
\arguments{
\item{model}{dump generated by the \code{xgb.train} function.}
\item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{features.keep}{number of features to keep in each position of the multi trees.}
\item{plot.width}{width in pixels of the graph to produce}
\item{plot.height}{height in pixels of the graph to produce}
}
\value{
Two graphs showing the distribution of the model deepness.
}
\description{
Visualization of the ensemble of trees as a single collective unit.
}
\details{
This function tries to capture the complexity of gradient boosted tree ensemble
in a cohesive way.
The goal is to improve the interpretability of the model generally seen as black box.
The function is dedicated to boosting applied to decision trees only.
The purpose is to move from an ensemble of trees to a single tree only.
It takes advantage of the fact that the shape of a binary tree is only defined by
its deepness (therefore in a boosting model, all trees have the same shape).
Moreover, the trees tend to reuse the same features.
The function will project each tree on one, and keep for each position the
\code{features.keep} first features (based on Gain per feature measure).
This function is inspired by this blog post:
\url{https://wellecks.wordpress.com/2015/02/21/peering-into-the-black-box-visualizing-lambdamart/}
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
min_child_weight = 50)
p <- xgb.plot.multi.trees(model = bst, feature_names = agaricus.train$data@Dimnames[[2]], features.keep = 3)
print(p)
}

View File

@ -1,58 +1,48 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.tree.R % Please edit documentation in R/xgb.plot.tree.R
\name{xgb.plot.tree} \name{xgb.plot.tree}
\alias{xgb.plot.tree} \alias{xgb.plot.tree}
\title{Plot a boosted tree model} \title{Plot a boosted tree model}
\usage{ \usage{
xgb.plot.tree(feature_names = NULL, filename_dump = NULL, model = NULL, xgb.plot.tree(feature_names = NULL, model = NULL, n_first_tree = NULL,
n_first_tree = NULL, CSSstyle = NULL, width = NULL, height = NULL) plot.width = NULL, plot.height = NULL)
} }
\arguments{ \arguments{
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.} \item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}). Possible to provide a model directly (see \code{model} argument).}
\item{model}{generated by the \code{xgb.train} function. Avoid the creation of a dump file.} \item{model}{generated by the \code{xgb.train} function. Avoid the creation of a dump file.}
\item{n_first_tree}{limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.} \item{n_first_tree}{limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.}
\item{CSSstyle}{a \code{character} vector storing a css style to customize the appearance of nodes. Look at the \href{https://github.com/knsv/mermaid/wiki}{Mermaid wiki} for more information.} \item{plot.width}{the width of the diagram in pixels.}
\item{width}{the width of the diagram in pixels.} \item{plot.height}{the height of the diagram in pixels.}
\item{height}{the height of the diagram in pixels.}
} }
\value{ \value{
A \code{DiagrammeR} of the model. A \code{DiagrammeR} of the model.
} }
\description{ \description{
Read a tree model text dump. Read a tree model text dump and plot the model.
Plotting only works for boosted tree model (not linear model).
} }
\details{ \details{
The content of each node is organised that way: The content of each node is organised that way:
\itemize{ \itemize{
\item \code{feature} value ; \item \code{feature} value;
\item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be ; \item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be;
\item \code{gain}: metric the importance of the node in the model. \item \code{gain}: metric the importance of the node in the model.
} }
Each branch finishes with a leaf. For each leaf, only the \code{cover} is indicated. The function uses \href{http://www.graphviz.org/}{GraphViz} library for that purpose.
It uses \href{https://github.com/knsv/mermaid/}{Mermaid} library for that purpose.
} }
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic") eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix. # agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.plot.tree(agaricus.train$data@Dimnames[[2]], model = bst) xgb.plot.tree(feature_names = agaricus.train$data@Dimnames[[2]], model = bst)
} }

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.save.R % Please edit documentation in R/xgb.save.R
\name{xgb.save} \name{xgb.save}
\alias{xgb.save} \alias{xgb.save}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.save.raw.R % Please edit documentation in R/xgb.save.raw.R
\name{xgb.save.raw} \name{xgb.save.raw}
\alias{xgb.save.raw} \alias{xgb.save.raw}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.train.R % Please edit documentation in R/xgb.train.R
\name{xgb.train} \name{xgb.train}
\alias{xgb.train} \alias{xgb.train}
@ -27,7 +27,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
\item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3 \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
\item \code{max_depth} maximum depth of a tree. Default: 6 \item \code{max_depth} maximum depth of a tree. Default: 6
\item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1 \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
@ -51,7 +51,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
\item \code{binary:logistic} logistic regression for binary classification. Output probability. \item \code{binary:logistic} logistic regression for binary classification. Output probability.
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation. \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
\item \code{num_class} set the number of classes. To use only with multiclass objectives. \item \code{num_class} set the number of classes. To use only with multiclass objectives.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{tonum_class}. \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class}.
\item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class. \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss. \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
} }
@ -64,10 +64,10 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
\item{nrounds}{the max number of iterations} \item{nrounds}{the max number of iterations}
\item{watchlist}{what information should be printed when \code{verbose=1} or \item{watchlist}{what information should be printed when \code{verbose=1} or
\code{verbose=2}. Watchlist is used to specify validation set monitoring \code{verbose=2}. Watchlist is used to specify validation set monitoring
during training. For example user can specify during training. For example user can specify
watchlist=list(validation1=mat1, validation2=mat2) to watch watchlist=list(validation1=mat1, validation2=mat2) to watch
the performance of each round's model on mat1 and mat2} the performance of each round's model on mat1 and mat2}
\item{obj}{customized objective function. Returns gradient and second order \item{obj}{customized objective function. Returns gradient and second order
gradient with given prediction and dtrain,} gradient with given prediction and dtrain,}
@ -110,6 +110,7 @@ Number of threads can also be manually specified via \code{nthread} parameter.
\itemize{ \itemize{
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error} \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood} \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
\item \code{mlogloss} multiclass logloss. \url{https://www.kaggle.com/wiki/MultiClassLogLoss}
\item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
\item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
\item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation. \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.

View File

@ -1,10 +1,10 @@
% Generated by roxygen2 (4.1.1): do not edit by hand % Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgboost.R % Please edit documentation in R/xgboost.R
\name{xgboost} \name{xgboost}
\alias{xgboost} \alias{xgboost}
\title{eXtreme Gradient Boosting (Tree) library} \title{eXtreme Gradient Boosting (Tree) library}
\usage{ \usage{
xgboost(data = NULL, label = NULL, missing = NULL, weight = NULL, xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
params = list(), nrounds, verbose = 1, print.every.n = 1L, params = list(), nrounds, verbose = 1, print.every.n = 1L,
early.stop.round = NULL, maximize = NULL, save_period = 0, early.stop.round = NULL, maximize = NULL, save_period = 0,
save_name = "xgboost.model", ...) save_name = "xgboost.model", ...)
@ -78,5 +78,6 @@ test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max.depth = 2, bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2, objective = "binary:logistic") eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
pred <- predict(bst, test$data) pred <- predict(bst, test$data)
} }

View File

@ -4,30 +4,33 @@ context("basic functions")
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost') data(agaricus.test, package='xgboost')
train = agaricus.train train <- agaricus.train
test = agaricus.test test <- agaricus.test
set.seed(1994)
test_that("train and predict", { test_that("train and predict", {
bst = xgboost(data = train$data, label = train$label, max.depth = 2, bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2, objective = "binary:logistic") eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
pred = predict(bst, test$data) pred <- predict(bst, test$data)
expect_equal(length(pred), 1611)
}) })
test_that("early stopping", { test_that("early stopping", {
res = xgb.cv(data = train$data, label = train$label, max.depth = 2, nfold = 5, res <- xgb.cv(data = train$data, label = train$label, max.depth = 2, nfold = 5,
eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic", eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic",
early.stop.round = 3, maximize = FALSE) early.stop.round = 3, maximize = FALSE)
expect_true(nrow(res)<20) expect_true(nrow(res) < 20)
bst = xgboost(data = train$data, label = train$label, max.depth = 2, bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic", eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic",
early.stop.round = 3, maximize = FALSE) early.stop.round = 3, maximize = FALSE)
pred = predict(bst, test$data) pred <- predict(bst, test$data)
expect_equal(length(pred), 1611)
}) })
test_that("save_period", { test_that("save_period", {
bst = xgboost(data = train$data, label = train$label, max.depth = 2, bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic", eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic",
save_period = 10, save_name = "xgb.model") save_period = 10, save_name = "xgb.model")
pred = predict(bst, test$data) pred <- predict(bst, test$data)
expect_equal(length(pred), 1611)
}) })

View File

@ -2,25 +2,26 @@ context('Test models with custom objective')
require(xgboost) require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
test_that("custom objective works", { test_that("custom objective works", {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(eval = dtest, train = dtrain) watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2 num_round <- 2
logregobj <- function(preds, dtrain) { logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label") labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds)) preds <- 1 / (1 + exp(-preds))
grad <- preds - labels grad <- preds - labels
hess <- preds * (1 - preds) hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess)) return(list(grad = grad, hess = hess))
} }
evalerror <- function(preds, dtrain) { evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label") labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0)))/length(labels) err <- as.numeric(sum(labels != (preds > 0))) / length(labels)
return(list(metric = "error", value = err)) return(list(metric = "error", value = err))
} }
@ -34,13 +35,13 @@ test_that("custom objective works", {
logregobjattr <- function(preds, dtrain) { logregobjattr <- function(preds, dtrain) {
labels <- attr(dtrain, 'label') labels <- attr(dtrain, 'label')
preds <- 1/(1 + exp(-preds)) preds <- 1 / (1 + exp(-preds))
grad <- preds - labels grad <- preds - labels
hess <- preds * (1 - preds) hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess)) return(list(grad = grad, hess = hess))
} }
param <- list(max.depth=2, eta=1, nthread = 2, silent=1, param <- list(max.depth=2, eta=1, nthread = 2, silent = 1,
objective=logregobjattr, eval_metric=evalerror) objective = logregobjattr, eval_metric = evalerror)
bst <- xgb.train(param, dtrain, num_round, watchlist) bst <- xgb.train(param, dtrain, num_round, watchlist)
expect_equal(class(bst), "xgb.Booster") expect_equal(class(bst), "xgb.Booster")
expect_equal(length(bst$raw), 1064) expect_equal(length(bst$raw), 1064)

View File

@ -5,28 +5,64 @@ require(data.table)
require(Matrix) require(Matrix)
require(vcd) require(vcd)
set.seed(1982)
data(Arthritis) data(Arthritis)
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
df <- data.table(Arthritis, keep.rownames = F) df <- data.table(Arthritis, keep.rownames = F)
df[,AgeDiscret:= as.factor(round(Age/10,0))] df[,AgeDiscret := as.factor(round(Age / 10,0))]
df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))] df[,AgeCat := as.factor(ifelse(Age > 30, "Old", "Young"))]
df[,ID:=NULL] df[,ID := NULL]
sparse_matrix = sparse.model.matrix(Improved~.-1, data = df) sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y] output_vector <- df[,Y := 0][Improved == "Marked",Y := 1][,Y]
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 9, bst.Tree <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 9,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic") eta = 1, nthread = 2, nround = 10, objective = "binary:logistic", booster = "gbtree")
bst.GLM <- xgboost(data = sparse_matrix, label = output_vector,
eta = 1, nthread = 2, nround = 10, objective = "binary:logistic", booster = "gblinear")
feature.names <- agaricus.train$data@Dimnames[[2]]
test_that("xgb.dump works", { test_that("xgb.dump works", {
capture.output(print(xgb.dump(bst))) capture.output(print(xgb.dump(bst.Tree)))
capture.output(print(xgb.dump(bst.GLM)))
expect_true(xgb.dump(bst.Tree, 'xgb.model.dump', with.stats = T))
}) })
test_that("xgb.importance works", { test_that("xgb.model.dt.tree works with and without feature names", {
xgb.dump(bst, 'xgb.model.dump', with.stats = T) names.dt.trees <- c("ID", "Feature", "Split", "Yes", "No", "Missing", "Quality", "Cover",
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], 'xgb.model.dump') "Tree", "Yes.Feature", "Yes.Cover", "Yes.Quality", "No.Feature", "No.Cover", "No.Quality")
expect_equal(dim(importance), c(7, 4)) dt.tree <- xgb.model.dt.tree(feature_names = feature.names, model = bst.Tree)
expect_equal(names.dt.trees, names(dt.tree))
expect_equal(dim(dt.tree), c(162, 15))
xgb.model.dt.tree(model = bst.Tree)
}) })
test_that("xgb.plot.tree works", { test_that("xgb.importance works with and without feature names", {
xgb.plot.tree(agaricus.train$data@Dimnames[[2]], model = bst) importance.Tree <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst.Tree)
expect_equal(dim(importance.Tree), c(7, 4))
expect_equal(colnames(importance.Tree), c("Feature", "Gain", "Cover", "Frequency"))
xgb.importance(model = bst.Tree)
xgb.plot.importance(importance_matrix = importance.Tree)
})
test_that("xgb.importance works with GLM model", {
importance.GLM <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst.GLM)
expect_equal(dim(importance.GLM), c(10, 2))
expect_equal(colnames(importance.GLM), c("Feature", "Weight"))
xgb.importance(model = bst.GLM)
xgb.plot.importance(importance.GLM)
})
test_that("xgb.plot.tree works with and without feature names", {
xgb.plot.tree(feature_names = feature.names, model = bst.Tree)
xgb.plot.tree(model = bst.Tree)
})
test_that("xgb.plot.multi.trees works with and without feature names", {
xgb.plot.multi.trees(model = bst.Tree, feature_names = feature.names, features.keep = 3)
xgb.plot.multi.trees(model = bst.Tree, features.keep = 3)
})
test_that("xgb.plot.deepness works", {
xgb.plot.deepness(model = bst.Tree)
}) })

View File

@ -0,0 +1,27 @@
context("Code is of high quality and lint free")
test_that("Code Lint", {
skip_on_cran()
skip_on_travis()
skip_if_not_installed("lintr")
my_linters <- list(
absolute_paths_linter=lintr::absolute_paths_linter,
assignment_linter=lintr::assignment_linter,
closed_curly_linter=lintr::closed_curly_linter,
commas_linter=lintr::commas_linter,
# commented_code_linter=lintr::commented_code_linter,
infix_spaces_linter=lintr::infix_spaces_linter,
line_length_linter=lintr::line_length_linter,
no_tab_linter=lintr::no_tab_linter,
object_usage_linter=lintr::object_usage_linter,
# snake_case_linter=lintr::snake_case_linter,
# multiple_dots_linter=lintr::multiple_dots_linter,
object_length_linter=lintr::object_length_linter,
open_curly_linter=lintr::open_curly_linter,
# single_quotes_linter=lintr::single_quotes_linter,
spaces_inside_linter=lintr::spaces_inside_linter,
spaces_left_parentheses_linter=lintr::spaces_left_parentheses_linter,
trailing_blank_lines_linter=lintr::trailing_blank_lines_linter,
trailing_whitespace_linter=lintr::trailing_whitespace_linter
)
# lintr::expect_lint_free(linters=my_linters) # uncomment this if you want to check code quality
})

View File

@ -0,0 +1,32 @@
context('Test model params and call are exposed to R')
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
bst <- xgboost(data = dtrain,
max.depth = 2,
eta = 1,
nround = 10,
nthread = 1,
verbose = 0,
objective = "binary:logistic")
test_that("call is exposed to R", {
model_call <- attr(bst, "call")
expect_is(model_call, "call")
})
test_that("params is exposed to R", {
model_params <- attr(bst, "params")
expect_is(model_params, "list")
expect_equal(model_params$eta, 1)
expect_equal(model_params$max.depth, 2)
expect_equal(model_params$objective, "binary:logistic")
})

View File

@ -1,13 +1,14 @@
context('Test poisson regression model') context('Test poisson regression model')
require(xgboost) require(xgboost)
set.seed(1994)
test_that("poisson regression works", { test_that("poisson regression works", {
data(mtcars) data(mtcars)
bst = xgboost(data=as.matrix(mtcars[,-11]),label=mtcars[,11], bst <- xgboost(data = as.matrix(mtcars[,-11]),label = mtcars[,11],
objective='count:poisson',nrounds=5) objective = 'count:poisson', nrounds=5)
expect_equal(class(bst), "xgb.Booster") expect_equal(class(bst), "xgb.Booster")
pred = predict(bst,as.matrix(mtcars[,-11])) pred <- predict(bst,as.matrix(mtcars[, -11]))
expect_equal(length(pred), 32) expect_equal(length(pred), 32)
sqrt(mean((pred-mtcars[,11])^2)) expect_equal(sqrt(mean( (pred - mtcars[,11]) ^ 2)), 1.16, tolerance = 0.01)
}) })

View File

@ -190,7 +190,7 @@ Measure feature importance
In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the features (remember, each binary column == one value of one *categorical* feature). In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the features (remember, each binary column == one value of one *categorical* feature).
```{r} ```{r}
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst) importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
head(importance) head(importance)
``` ```
@ -202,7 +202,7 @@ head(importance)
`Cover` measures the relative quantity of observations concerned by a feature. `Cover` measures the relative quantity of observations concerned by a feature.
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it). `Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
### Improvement in the interpretability of feature importance data.table ### Improvement in the interpretability of feature importance data.table
@ -213,10 +213,10 @@ One simple solution is to count the co-occurrences of a feature and a class of t
For that purpose we will execute the same function as above but using two more parameters, `data` and `label`. For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.
```{r} ```{r}
importanceRaw <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector) importanceRaw <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)
# Cleaning for better display # Cleaning for better display
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)] importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequency=NULL)]
head(importanceClean) head(importanceClean)
``` ```

View File

@ -345,7 +345,7 @@ Feature importance is similar to R gbm package's relative influence (rel.inf).
``` ```
importance_matrix <- xgb.importance(model = bst) importance_matrix <- xgb.importance(model = bst)
print(importance_matrix) print(importance_matrix)
xgb.plot.importance(importance_matrix) xgb.plot.importance(importance_matrix = importance_matrix)
``` ```
View the trees from a model View the trees from a model

View File

@ -2,7 +2,9 @@
=========== ===========
[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost) [![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost)
[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org) [![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org)
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE)
[![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost) [![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost)
[![PyPI version](https://badge.fury.io/py/xgboost.svg)](https://pypi.python.org/pypi/xgboost/)
[![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version. An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version.
@ -29,6 +31,9 @@ Contents
What's New What's New
---------- ----------
* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
* XGBoost helps Owen Zhang to win the [Avito Context Ad Click competition](https://www.kaggle.com/c/avito-context-ad-clicks). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/). * XGBoost helps Owen Zhang to win the [Avito Context Ad Click competition](https://www.kaggle.com/c/avito-context-ad-clicks). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/).
* XGBoost helps Chenglong Chen to win [Kaggle CrowdFlower Competition](https://www.kaggle.com/c/crowdflower-search-relevance) * XGBoost helps Chenglong Chen to win [Kaggle CrowdFlower Competition](https://www.kaggle.com/c/crowdflower-search-relevance)
Check out the [winning solution](https://github.com/ChenglongChen/Kaggle_CrowdFlower) Check out the [winning solution](https://github.com/ChenglongChen/Kaggle_CrowdFlower)

View File

@ -22,8 +22,8 @@ This is a list of short codes introducing different functionalities of xgboost p
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl) [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
* Predicting using first n trees * Predicting using first n trees
[python](guide-python/predict_first_ntree.py) [python](guide-python/predict_first_ntree.py)
[R](../R-package/demo/boost_from_prediction.R) [R](../R-package/demo/predict_first_ntree.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl) [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/predict_first_ntree.jl)
* Generalized Linear Model * Generalized Linear Model
[python](guide-python/generalized_linear_model.py) [python](guide-python/generalized_linear_model.py)
[R](../R-package/demo/generalized_linear_model.R) [R](../R-package/demo/generalized_linear_model.R)
@ -49,4 +49,3 @@ Benchmarks
---------- ----------
* [Starter script for Kaggle Higgs Boson](kaggle-higgs) * [Starter script for Kaggle Higgs Boson](kaggle-higgs)
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution) * [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)

View File

@ -9,4 +9,6 @@ XGBoost Python Feature Walkthrough
* [Predicting leaf indices](predict_leaf_indices.py) * [Predicting leaf indices](predict_leaf_indices.py)
* [Sklearn Wrapper](sklearn_examples.py) * [Sklearn Wrapper](sklearn_examples.py)
* [Sklearn Parallel](sklearn_parallel.py) * [Sklearn Parallel](sklearn_parallel.py)
* [Sklearn access evals result](sklearn_evals_result.py)
* [Access evals result](evals_result.py)
* [External Memory](external_memory.py) * [External Memory](external_memory.py)

View File

@ -0,0 +1,30 @@
##
# This script demonstrate how to access the eval metrics in xgboost
##
import xgboost as xgb
dtrain = xgb.DMatrix('../data/agaricus.txt.train', silent=True)
dtest = xgb.DMatrix('../data/agaricus.txt.test', silent=True)
param = [('max_depth', 2), ('objective', 'binary:logistic'), ('eval_metric', 'logloss'), ('eval_metric', 'error')]
num_round = 2
watchlist = [(dtest,'eval'), (dtrain,'train')]
evals_result = {}
bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=evals_result)
print('Access logloss metric directly from evals_result:')
print(evals_result['eval']['logloss'])
print('')
print('Access metrics through a loop:')
for e_name, e_mtrs in evals_result.items():
print('- {}'.format(e_name))
for e_mtr_name, e_mtr_vals in e_mtrs.items():
print(' - {}'.format(e_mtr_name))
print(' - {}'.format(e_mtr_vals))
print('')
print('Access complete dictionary:')
print(evals_result)

View File

@ -0,0 +1,43 @@
##
# This script demonstrate how to access the xgboost eval metrics by using sklearn
##
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_hastie_10_2
X, y = make_hastie_10_2(n_samples=2000, random_state=42)
# Map labels from {-1, 1} to {0, 1}
labels, y = np.unique(y, return_inverse=True)
X_train, X_test = X[:1600], X[1600:]
y_train, y_test = y[:1600], y[1600:]
param_dist = {'objective':'binary:logistic', 'n_estimators':2}
clf = xgb.XGBModel(**param_dist)
# Or you can use: clf = xgb.XGBClassifier(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
# Load evals result by calling the evals_result() function
evals_result = clf.evals_result()
print('Access logloss metric directly from validation_0:')
print(evals_result['validation_0']['logloss'])
print('')
print('Access metrics through a loop:')
for e_name, e_mtrs in evals_result.items():
print('- {}'.format(e_name))
for e_mtr_name, e_mtr_vals in e_mtrs.items():
print(' - {}'.format(e_mtr_name))
print(' - {}'.format(e_mtr_vals))
print('')
print('Access complete dict:')
print(evals_result)

View File

@ -15,66 +15,17 @@ Build XGBoost in OS X with OpenMP
--------------------------------- ---------------------------------
Here is the complete solution to use OpenMp-enabled compilers to install XGBoost. Here is the complete solution to use OpenMp-enabled compilers to install XGBoost.
1. Obtain gcc with openmp support by `brew install gcc --without-multilib` **or** clang with openmp by `brew install clang-omp`. The clang one is recommended because the first method requires us compiling gcc inside the machine (more than an hour in mine)! (BTW, `brew` is the de facto standard of `apt-get` on OS X. So installing [HPC](http://hpc.sourceforge.net/) separately is not recommended, but it should work.) 1. Obtain gcc-5.x.x with openmp support by `brew install gcc --without-multilib`. (`brew` is the de facto standard of `apt-get` on OS X. So installing [HPC](http://hpc.sourceforge.net/) separately is not recommended, but it should work.)
2. **if you are planing to use clang-omp** - in step 3 and/or 4, change line 9 in `xgboost/src/utils/omp.h` to 2. `cd xgboost` then `bash build.sh` to compile XGBoost.
```C++ 3. Install xgboost package for Python and R
#include <libiomp/omp.h> /* instead of #include <omp.h> */`
```
to make it work, otherwise you might get this error - For Python: go to `python-package` sub-folder to install python version with `python setup.py install` (or `sudo python setup.py install`).
- For R: Set the `Makevars` file in highest piority for R.
`src/tree/../utils/omp.h:9:10: error: 'omp.h' file not found...`
3. Set the `Makefile` correctly for compiling cpp version xgboost then python version xgboost.
```Makefile
export CC = gcc-4.9
export CXX = g++-4.9
```
Or
```Makefile
export CC = clang-omp
export CXX = clang-omp++
```
Remember to change `header` (mentioned in step 2) if using clang-omp.
Then `cd xgboost` then `bash build.sh` to compile XGBoost. And go to `wrapper` sub-folder to install python version.
4. Set the `Makevars` file in highest piority for R.
The point is, there are three `Makevars` : `~/.R/Makevars`, `xgboost/R-package/src/Makevars`, and `/usr/local/Cellar/r/3.2.0/R.framework/Resources/etc/Makeconf` (the last one obtained by running `file.path(R.home("etc"), "Makeconf")` in R), and `SHLIB_OPENMP_CXXFLAGS` is not set by default!! After trying, it seems that the first one has highest piority (surprise!). The point is, there are three `Makevars` : `~/.R/Makevars`, `xgboost/R-package/src/Makevars`, and `/usr/local/Cellar/r/3.2.0/R.framework/Resources/etc/Makeconf` (the last one obtained by running `file.path(R.home("etc"), "Makeconf")` in R), and `SHLIB_OPENMP_CXXFLAGS` is not set by default!! After trying, it seems that the first one has highest piority (surprise!).
So, **add** or **change** `~/.R/Makevars` to the following lines:
```Makefile
CC=gcc-4.9
CXX=g++-4.9
SHLIB_OPENMP_CFLAGS = -fopenmp
SHLIB_OPENMP_CXXFLAGS = -fopenmp
SHLIB_OPENMP_FCFLAGS = -fopenmp
SHLIB_OPENMP_FFLAGS = -fopenmp
```
Or
```Makefile
CC=clang-omp
CXX=clang-omp++
SHLIB_OPENMP_CFLAGS = -fopenmp
SHLIB_OPENMP_CXXFLAGS = -fopenmp
SHLIB_OPENMP_FCFLAGS = -fopenmp
SHLIB_OPENMP_FFLAGS = -fopenmp
```
Again, remember to change `header` if using clang-omp.
Then inside R, run Then inside R, run
```R ```R

View File

@ -11,7 +11,7 @@ This document is hosted at http://xgboost.readthedocs.org/. You can also browse
How to Get Started How to Get Started
------------------ ------------------
The best way to get started to learn xgboost is by the examples. There are three types of examples you can find in xgboost. The best way to get started to learn xgboost is by the examples. There are three types of examples you can find in xgboost.
* [Tutorials](#tutorials) are self-conatained tutorials on a complete data science tasks. * [Tutorials](#tutorials) are self-contained tutorials on complete data science tasks.
* [XGBoost Code Examples](../demo/) are collections of code and benchmarks of xgboost. * [XGBoost Code Examples](../demo/) are collections of code and benchmarks of xgboost.
- There is a walkthrough section in this to walk you through specific API features. - There is a walkthrough section in this to walk you through specific API features.
* [Highlight Solutions](#highlight-solutions) are presentations using xgboost to solve real world problems. * [Highlight Solutions](#highlight-solutions) are presentations using xgboost to solve real world problems.

View File

@ -1,8 +1,12 @@
Introduction to Boosted Trees Introduction to Boosted Trees
============================= =============================
XGBoost is short for "Extreme Gradient Boosting", where the term "Gradient Boosting" is proposed in the paper _Greedy Function Approximation: A Gradient Boosting Machine_, Friedman. Based on this original model. This is a tutorial on boosted trees, most of content are based on this [slide](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) by the author of xgboost. XGBoost is short for "Extreme Gradient Boosting", where the term "Gradient Boosting" is proposed in the paper _Greedy Function Approximation: A Gradient Boosting Machine_, by Friedman.
XGBoost is based on this original model.
This is a tutorial on gradient boosted trees, and most of the content is based on these [slides](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) by the author of xgboost.
The GBM(boosted trees) has been around for really a while, and there are a lot of materials on the topic. This tutorial tries to explain boosted trees in a self-contained and principled way of supervised learning. We think this explanation is cleaner, more formal, and motivates the variant used in xgboost. The GBM (boosted trees) has been around for really a while, and there are a lot of materials on the topic.
This tutorial tries to explain boosted trees in a self-contained and principled way using the elements of supervised learning.
We think this explanation is cleaner, more formal, and motivates the variant used in xgboost.
Elements of Supervised Learning Elements of Supervised Learning
------------------------------- -------------------------------
@ -10,21 +14,21 @@ XGBoost is used for supervised learning problems, where we use the training data
Before we dive into trees, let us start by reviewing the basic elements in supervised learning. Before we dive into trees, let us start by reviewing the basic elements in supervised learning.
### Model and Parameters ### Model and Parameters
The ***model*** in supervised learning usually refers to the mathematical structure on how to given the prediction ``$ y_i $`` given ``$ x_i $``. The ***model*** in supervised learning usually refers to the mathematical structure of how to make the prediction ``$ y_i $`` given ``$ x_i $``.
For example, a common model is *linear model*, where the prediction is given by ``$ \hat{y}_i = \sum_j w_j x_{ij} $``, a linear combination of weighted input features. For example, a common model is a *linear model*, where the prediction is given by ``$ \hat{y}_i = \sum_j w_j x_{ij} $``, a linear combination of weighted input features.
The prediction value can have different interpretations, depending on the task. The prediction value can have different interpretations, depending on the task.
For example, it can be logistic transformed to get the probability of positive class in logistic regression, and it can also be used as ranking score when we want to rank the outputs. For example, it can be logistic transformed to get the probability of positive class in logistic regression, and it can also be used as a ranking score when we want to rank the outputs.
The ***parameters*** are the undermined part that we need to learn from data. In linear regression problem, the parameters are the co-efficients ``$ w $``. The ***parameters*** are the undetermined part that we need to learn from data. In linear regression problems, the parameters are the coefficients ``$ w $``.
Usually we will use ``$ \Theta $`` to denote the parameters. Usually we will use ``$ \Theta $`` to denote the parameters.
### Objective Function : Training Loss + Regularization ### Objective Function : Training Loss + Regularization
Based on different understanding or assumption of ``$ y_i $``, we can have different problems as regression, classification, ordering, etc. Based on different understandings of ``$ y_i $`` we can have different problems, such as regression, classification, ordering, etc.
We need to find a way to find the best parameters given the training data. In order to do so, we need to define a so called ***objective function***, We need to find a way to find the best parameters given the training data. In order to do so, we need to define a so-called ***objective function***,
to measure the performance of the model under certain set of parameters. to measure the performance of the model given a certain set of parameters.
A very important fact about objective functions, is they ***must always*** contains two parts: training loss and regularization. A very important fact about objective functions is they ***must always*** contain two parts: training loss and regularization.
```math ```math
Obj(\Theta) = L(\Theta) + \Omega(\Theta) Obj(\Theta) = L(\Theta) + \Omega(\Theta)
@ -44,7 +48,8 @@ L(\theta) = \sum_i[ y_i\ln (1+e^{-\hat{y}_i}) + (1-y_i)\ln (1+e^{\hat{y}_i})]
The ***regularization term*** is what people usually forget to add. The regularization term controls the complexity of the model, which helps us to avoid overfitting. The ***regularization term*** is what people usually forget to add. The regularization term controls the complexity of the model, which helps us to avoid overfitting.
This sounds a bit abstract, so let us consider the following problem in the following picture. You are asked to *fit* visually a step function given the input data points This sounds a bit abstract, so let us consider the following problem in the following picture. You are asked to *fit* visually a step function given the input data points
on the upper left corner of the image, which solution among the tree you think is the best fit? on the upper left corner of the image.
Which solution among the three do you think is the best fit?
![Step function](img/step_fit.png) ![Step function](img/step_fit.png)
@ -53,26 +58,26 @@ The tradeoff between the two is also referred as bias-variance tradeoff in machi
### Why introduce the general principle ### Why introduce the general principle
The elements introduced in above forms the basic elements of supervised learning, and they are naturally the building blocks of machine learning toolkits. The elements introduced above form the basic elements of supervised learning, and they are naturally the building blocks of machine learning toolkits.
For example, you should be able to answer what is the difference and common parts between boosted trees and random forest. For example, you should be able to describe the differences and commonalities between boosted trees and random forests.
Understanding the process in a formalized way also helps us to understand the objective that we are learning and the reason behind the heurestics such as Understanding the process in a formalized way also helps us to understand the objective that we are learning and the reason behind the heuristics such as
pruning and smoothing. pruning and smoothing.
Tree Ensemble Tree Ensemble
------------- -------------
Now that we have introduced the elements of supervised learning, let us get started with real trees. Now that we have introduced the elements of supervised learning, let us get started with real trees.
To begin with, let us first learn what is the ***model*** of xgboost: tree ensembles. To begin with, let us first learn about the ***model*** of xgboost: tree ensembles.
The tree ensemble model is a set of classification and regression trees (CART). Here's a simple example of a CART The tree ensemble model is a set of classification and regression trees (CART). Here's a simple example of a CART
that classifies is someone will like computer games. that classifies whether someone will like computer games.
![CART](img/cart.png) ![CART](img/cart.png)
We classify the members in thie family into different leaves, and assign them the score on corresponding leaf. We classify the members of a family into different leaves, and assign them the score on corresponding leaf.
A CART is a bit different from decision trees, where the leaf only contain decision values. In CART, a real score A CART is a bit different from decision trees, where the leaf only contains decision values. In CART, a real score
is associated with each of the leaves, which gives us richer interpretations that go beyond classification. is associated with each of the leaves, which gives us richer interpretations that go beyond classification.
This also makes the unified optimization step easier, as we will see in later part of this tutorial. This also makes the unified optimization step easier, as we will see in later part of this tutorial.
Usually, a single tree is not so strong enough to be used in practice. What is actually used is the so called Usually, a single tree is not strong enough to be used in practice. What is actually used is the so-called
tree ensemble model, that sums the prediction of multiple trees together. tree ensemble model, that sums the prediction of multiple trees together.
![TwoCART](img/twocart.png) ![TwoCART](img/twocart.png)
@ -90,9 +95,9 @@ where ``$ K $`` is the number of trees, ``$ f $`` is a function in the functiona
```math ```math
obj(\Theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k) obj(\Theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k)
``` ```
Now here comes the question, what is the *model* of random forest? It is exactly tree ensembles! So random forest and boosted trees are not different in terms of model, Now here comes the question, what is the *model* for random forests? It is exactly tree ensembles! So random forests and boosted trees are not different in terms of model,
the difference is how we train them. This means if you write a predictive service of tree ensembles, you only need to write one of them and they should directly work the difference is how we train them. This means if you write a predictive service of tree ensembles, you only need to write one of them and they should directly work
for both random forest and boosted trees. One example of elements of supervised learning rocks. for both random forests and boosted trees. One example of why elements of supervised learning rocks.
Tree Boosting Tree Boosting
------------- -------------
@ -106,10 +111,11 @@ Obj = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
### Additive Training ### Additive Training
First thing we want to ask is what are ***parameters*** of trees. You can find what we need to learn are those functions ``$f_i$``, with each contains the structure First thing we want to ask is what are the ***parameters*** of trees.
of the tree, and the leaf score. This is much harder than traditional optimization problem where you can take the gradient and go. You can find what we need to learn are those functions ``$f_i$``, with each containing the structure
of the tree and the leaf scores. This is much harder than traditional optimization problem where you can take the gradient and go.
It is not easy to train all the trees at once. It is not easy to train all the trees at once.
Instead, we use an additive strategy: fix what we have learned, add a new tree at a time. Instead, we use an additive strategy: fix what we have learned, add one new tree at a time.
We note the prediction value at step ``$t$`` by ``$ \hat{y}_i^{(t)}$``, so we have We note the prediction value at step ``$t$`` by ``$ \hat{y}_i^{(t)}$``, so we have
```math ```math
@ -120,7 +126,7 @@ We note the prediction value at step ``$t$`` by ``$ \hat{y}_i^{(t)}$``, so we ha
\hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i) \hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)
``` ```
It remains to ask Which tree do we want at each step? A natural thing is to add the one that optimizes our objective. It remains to ask, which tree do we want at each step? A natural thing is to add the one that optimizes our objective.
```math ```math
Obj^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ Obj^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
@ -135,8 +141,8 @@ Obj^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}
``` ```
The form of MSE is friendly, with a first order term (usually called residual) and a quadratic term. The form of MSE is friendly, with a first order term (usually called residual) and a quadratic term.
For other loss of interest (for example, logistic loss), it is not so easy to get such a nice form. For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form.
So in general case, we take the Taylor expansion of the loss function up to the second order So in the general case, we take the Taylor expansion of the loss function up to the second order
```math ```math
Obj^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + constant Obj^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + constant
@ -148,15 +154,15 @@ g_i &= \partial_{\hat{y}_i^{(t)}} l(y_i, \hat{y}_i^{(t-1)})\\
h_i &= \partial_{\hat{y}_i^{(t)}}^2 l(y_i, \hat{y}_i^{(t-1)}) h_i &= \partial_{\hat{y}_i^{(t)}}^2 l(y_i, \hat{y}_i^{(t-1)})
``` ```
After we remove all the constants, the specific objective at t step becomes After we remove all the constants, the specific objective at step ``$t$`` becomes
```math ```math
\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)
``` ```
This becomes our optimization goal for the new tree. One important advantage of this definition, is that This becomes our optimization goal for the new tree. One important advantage of this definition is that
it only depends on ``$g_i$`` and ``$h_i$``, this is how xgboost allows support of customization of loss functions. it only depends on ``$g_i$`` and ``$h_i$``. This is how xgboost can support custom loss functions.
We can optimized every loss function, including logistic regression, weighted logistic regression, using the exactly We can optimize every loss function, including logistic regression, weighted logistic regression, using the exactly
the same solver that takes ``$g_i$`` and ``$h_i$`` as input! the same solver that takes ``$g_i$`` and ``$h_i$`` as input!
### Model Complexity ### Model Complexity
@ -173,9 +179,9 @@ In XGBoost, we define the complexity as
```math ```math
\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2
``` ```
Of course there is more than one way to define the complexity, but this specific one works well in practice. The regularization is one part most tree packages takes Of course there is more than one way to define the complexity, but this specific one works well in practice. The regularization is one part most tree packages treat
less carefully, or simply ignore. This was due to the traditional treatment tree learning only emphasize improving impurity, while the complexity control part less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics.
are more lies as part of heuristics. By defining it formally, we can get a better idea of what we are learning, and yes it works well in practice. By defining it formally, we can get a better idea of what we are learning, and yes it works well in practice.
### The Structure Score ### The Structure Score
@ -186,13 +192,15 @@ Obj^{(t)} &\approx \sum_{i=1}^n [g_i w_q(x_i) + \frac{1}{2} h_i w_{q(x_i)}^2] +
&= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T &= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T
``` ```
where ``$ I_j = \{i|q(x_i)=j\} $`` is the set of indices of data points assigned to the ``$ j $``-th leaf. Notice that in the second line we have change the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining ``$ G_j = \sum_{i\in I_j} g_i $`` and ``$ H_j = \sum_{i\in I_j} h_i $``: where ``$ I_j = \{i|q(x_i)=j\} $`` is the set of indices of data points assigned to the ``$ j $``-th leaf.
Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score.
We could further compress the expression by defining ``$ G_j = \sum_{i\in I_j} g_i $`` and ``$ H_j = \sum_{i\in I_j} h_i $``:
```math ```math
Obj^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T Obj^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T
``` ```
In this equation ``$ w_j $`` are independent to each other, the form ``$ G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2 $`` is quadratic and the best ``$ w_j $`` for a given structure ``$q(x)$`` and the best objective reduction we can get: In this equation ``$ w_j $`` are independent to each other, the form ``$ G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2 $`` is quadratic and the best ``$ w_j $`` for a given structure ``$q(x)$`` and the best objective reduction we can get is:
```math ```math
w_j^\ast = -\frac{G_j}{H_j+\lambda}\\ w_j^\ast = -\frac{G_j}{H_j+\lambda}\\
@ -202,30 +210,31 @@ The last equation measures ***how good*** a tree structure ``$q(x)$`` is.
![Structure Score](img/struct_score.png) ![Structure Score](img/struct_score.png)
If all these sounds a bit complicated. Let us take a look the the picture, and see how the scores can be calculated. If all this sounds a bit complicated, let's take a look at the picture, and see how the scores can be calculated.
Basically, for a given tree structure, we push the statistics ``$g_i$`` and ``$h_i$`` to the leaves they belong to, Basically, for a given tree structure, we push the statistics ``$g_i$`` and ``$h_i$`` to the leaves they belong to,
sum the statistics together, and use the formula to calulate how good the tree is. sum the statistics together, and use the formula to calculate how good the tree is.
This score is like impurity measure in decision tree, except that it also takes the model complexity into account. This score is like the impurity measure in a decision tree, except that it also takes the model complexity into account.
### Learn the tree structure ### Learn the tree structure
Now we have a way to measure how good a tree is ideally we can enumerate all possible trees and pick the best one. Now that we have a way to measure how good a tree is, ideally we would enumerate all possible trees and pick the best one.
In practice it is impossible, so we will try to one level of the tree at a time. In practice it is intractable, so we will try to optimize one level of the tree at a time.
Specifically we try to split a leaf into two leaves, and the score it gains is Specifically we try to split a leaf into two leaves, and the score it gains is
```math ```math
Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma
``` ```
This formula can be decomposited as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf.
We can find an important fact here: if the gain is smaller than ``$\gamma$``, we would better not to add that branch. This is exactly the ***prunning*** techniques in tree based We can see an important fact here: if the gain is smaller than ``$\gamma$``, we would do better not to add that branch. This is exactly the ***pruning*** techniques in tree based
models! By using the principles of supervised learning, we can naturally comes up with the reason these techniques :) models! By using the principles of supervised learning, we can naturally come up with the reason these techniques work :)
For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in a sorted way, like the following picture. For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in sorted order, like the following picture.
![Best split](img/split_find.png) ![Best split](img/split_find.png)
Then a left to right scan is sufficient to calculate the structure score of all possible split solutions, and we can find the best split efficiently. Then a left to right scan is sufficient to calculate the structure score of all possible split solutions, and we can find the best split efficiently.
Final words on XGBoost Final words on XGBoost
---------------------- ----------------------
Now you have understand what is a boosted tree, you may ask, where is the introduction on [XGBoost](https://github.com/dmlc/xgboost)? Now that you understand what boosted trees are, you may ask, where is the introduction on [XGBoost](https://github.com/dmlc/xgboost)?
XGBoost is exactly a tool motivated by the formal principle introduced in this tutorial! XGBoost is exactly a tool motivated by the formal principle introduced in this tutorial!
More importantly, it is developed with both deep consideration in terms of ***systems optimization*** and ***principles in machine learning***. More importantly, it is developed with both deep consideration in terms of ***systems optimization*** and ***principles in machine learning***.
The goal of this library is to push the extreme of the computation limits of machines to provide a ***scalable***, ***portable*** and ***accurate*** library. The goal of this library is to push the extreme of the computation limits of machines to provide a ***scalable***, ***portable*** and ***accurate*** library.

View File

@ -1,6 +1,6 @@
XGBoost Parameters XGBoost Parameters
================== ==================
Before running XGboost, we must set three types of parameters, general parameters, booster parameters and task parameters: Before running XGboost, we must set three types of parameters: general parameters, booster parameters and task parameters.
- General parameters relates to which booster we are using to do boosting, commonly tree or linear model - General parameters relates to which booster we are using to do boosting, commonly tree or linear model
- Booster parameters depends on which booster you have chosen - Booster parameters depends on which booster you have chosen
- Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks. - Learning Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks.
@ -62,8 +62,8 @@ Parameters for Linear Booster
Learning Task Parameters Learning Task Parameters
------------------------ ------------------------
Specify the learning task and the corresponding learning objective. The objective options are below:
* objective [ default=reg:linear ] * objective [ default=reg:linear ]
- specify the learning task and the corresponding learning objective, and the objective options are below:
- "reg:linear" --linear regression - "reg:linear" --linear regression
- "reg:logistic" --logistic regression - "reg:logistic" --logistic regression
- "binary:logistic" --logistic regression for binary classification, output probability - "binary:logistic" --logistic regression for binary classification, output probability
@ -97,9 +97,9 @@ Command Line Parameters
----------------------- -----------------------
The following parameters are only used in the console version of xgboost The following parameters are only used in the console version of xgboost
* use_buffer [ default=1 ] * use_buffer [ default=1 ]
- whether create binary buffer for text input, this normally will speedup loading when do - Whether to create a binary buffer from text input. Doing so normally will speed up loading times
* num_round * num_round
- the number of round for boosting. - The number of rounds for boosting
* data * data
- The path of training data - The path of training data
* test:data * test:data

View File

@ -8,7 +8,7 @@ This document gives a basic walkthrough of xgboost python package.
Install XGBoost Install XGBoost
--------------- ---------------
To install XGBoost, do the following steps. To install XGBoost, do the following steps:
* You need to run `make` in the root directory of the project * You need to run `make` in the root directory of the project
* In the `python-package` directory run * In the `python-package` directory run
@ -22,34 +22,39 @@ import xgboost as xgb
Data Interface Data Interface
-------------- --------------
XGBoost python module is able to loading from libsvm txt format file, Numpy 2D array and xgboost binary buffer file. The data will be store in ```DMatrix``` object. The XGBoost python module is able to load data from:
- libsvm txt format file
- Numpy 2D array, and
- xgboost binary buffer file.
* To load libsvm text format file and XGBoost binary file into ```DMatrix```, the usage is like The data will be store in a ```DMatrix``` object.
* To load a libsvm text file or a XGBoost binary file into ```DMatrix```, the command is:
```python ```python
dtrain = xgb.DMatrix('train.svm.txt') dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer') dtest = xgb.DMatrix('test.svm.buffer')
``` ```
* To load numpy array into ```DMatrix```, the usage is like * To load a numpy array into ```DMatrix```, the command is:
```python ```python
data = np.random.rand(5,10) # 5 entities, each contains 10 features data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label) dtrain = xgb.DMatrix( data, label=label)
``` ```
* Build ```DMatrix``` from ```scipy.sparse``` * To load a scpiy.sparse array into ```DMatrix```, the command is:
```python ```python
csr = scipy.sparse.csr_matrix((dat, (row, col))) csr = scipy.sparse.csr_matrix((dat, (row, col)))
dtrain = xgb.DMatrix(csr) dtrain = xgb.DMatrix(csr)
``` ```
* Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time. The usage is like: * Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time:
```python ```python
dtrain = xgb.DMatrix('train.svm.txt') dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary("train.buffer") dtrain.save_binary("train.buffer")
``` ```
* To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` like: * To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` by specifying missing values:
```python ```python
dtrain = xgb.DMatrix(data, label=label, missing = -999.0) dtrain = xgb.DMatrix(data, label=label, missing = -999.0)
``` ```
* Weight can be set when needed, like * Weight can be set when needed:
```python ```python
w = np.random.rand(5, 1) w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w) dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w)
@ -62,10 +67,17 @@ XGBoost use list of pair to save [parameters](../parameter.md). Eg
```python ```python
param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' } param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
param['nthread'] = 4 param['nthread'] = 4
plst = param.items() param['eval_metric'] = 'auc'
plst += [('eval_metric', 'auc')] # Multiple evals can be handled in this way
plst += [('eval_metric', 'ams@0')]
``` ```
* You can also specify multiple eval metrics:
```python
param['eval_metric'] = ['auc', 'ams@0']
# alternativly:
# plst = param.items()
# plst += [('eval_metric', 'ams@0')]
```
* Specify validations set to watch performance * Specify validations set to watch performance
```python ```python
evallist = [(dtest,'eval'), (dtrain,'train')] evallist = [(dtest,'eval'), (dtrain,'train')]
@ -109,9 +121,9 @@ Early stopping requires at least one set in `evals`. If there's more than one, i
The model will train until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. The model will train until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training.
If early stopping occurs, the model will have two additional fields: `bst.best_score` and `bst.best_iteration`. Note that `train()` will return a model from the last iteration, not the best one. If early stopping occurs, the model will have three additional fields: `bst.best_score`, `bst.best_iteration` and `bst.best_ntree_limit`. Note that `train()` will return a model from the last iteration, not the best one.
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in `param['eval_metric']` is used for early stopping.
Prediction Prediction
---------- ----------
@ -123,9 +135,9 @@ dtest = xgb.DMatrix(data)
ypred = bst.predict(xgmat) ypred = bst.predict(xgmat)
``` ```
If early stopping is enabled during training, you can predict with the best iteration. If early stopping is enabled during training, you can get predicticions from the best iteration with `bst.best_ntree_limit`:
```python ```python
ypred = bst.predict(xgmat,ntree_limit=bst.best_iteration) ypred = bst.predict(xgmat,ntree_limit=bst.best_ntree_limit)
``` ```
Plotting Plotting

9
python-package/.pylintrc Normal file
View File

@ -0,0 +1,9 @@
[MASTER]
ignore=tests
unexpected-special-method-signature,too-many-nested-blocks
dummy-variables-rgx=(unused|)_.*
reports=no

View File

@ -1,7 +1,14 @@
include *.sh *.md include *.sh *.md *.rst
recursive-include xgboost * recursive-include xgboost *
recursive-include xgboost/wrapper * recursive-include xgboost/wrapper *
recursive-include xgboost/windows * recursive-include xgboost/windows *
recursive-include xgboost/subtree * recursive-include xgboost/subtree *
recursive-include xgboost/src * recursive-include xgboost/src *
recursive-include xgboost/multi-node * recursive-include xgboost/multi-node *
#exclude pre-compiled .o file for less confusions
#include the pre-compiled .so is needed as a placeholder
#since it will be copy after compiling on the fly
global-exclude xgboost/wrapper/*.so.gz
global-exclude xgboost/*.o
global-exclude *.pyo
global-exclude *.pyc

View File

@ -1,27 +0,0 @@
XGBoost Python Package
======================
Installation
------------
We are on [PyPI](https://pypi.python.org/pypi/xgboost) now. For stable version, please install using pip:
* ```pip install xgboost```
* Note for windows users: this pip installation may not work on some windows environment, and it may cause unexpected errors. pip installation on windows is currently disabled for further invesigation, please install from github.
For up-to-date version, please install from github.
* To make the python module, type ```./build.sh``` in the root directory of project
* Make sure you have [setuptools](https://pypi.python.org/pypi/setuptools)
* Install with `python setup.py install` from this directory.
* For windows users, please use the Visual Studio project file under [windows folder](../windows/). See also the [installation tutorial](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13043/run-xgboost-from-windows-and-python) from Kaggle Otto Forum.
Examples
------
* Refer also to the walk through example in [demo folder](../demo/guide-python)
* See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.py) on this dataset.
Note
-----
* If you want to build xgboost on Mac OS X with multiprocessing support where clang in XCode by default doesn't support, please install gcc 4.9 or higher using [homebrew](http://brew.sh/) ```brew tap homebrew/versions; brew install gcc49```
* If you want to run XGBoost process in parallel using the fork backend for joblib/multiprocessing, you must build XGBoost without support for OpenMP by `make no_omp=1`. Otherwise, use the forkserver (in Python 3.4) or spawn backend. See the [sklearn_parallel.py](../demo/guide-python/sklearn_parallel.py) demo.

56
python-package/README.rst Normal file
View File

@ -0,0 +1,56 @@
XGBoost Python Package
======================
|PyPI version| |PyPI downloads|
Installation
------------
We are on `PyPI <https://pypi.python.org/pypi/xgboost>`__ now. For
stable version, please install using pip:
- ``pip install xgboost``
- Note for windows users: this pip installation may not work on some
windows environment, and it may cause unexpected errors. pip
installation on windows is currently disabled for further
invesigation, please install from github.
For up-to-date version, please install from github.
- To make the python module, type ``./build.sh`` in the root directory
of project
- Make sure you have
`setuptools <https://pypi.python.org/pypi/setuptools>`__
- Install with ``cd python-package; python setup.py install`` from this directory.
- For windows users, please use the Visual Studio project file under
`windows folder <../windows/>`__. See also the `installation
tutorial <https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13043/run-xgboost-from-windows-and-python>`__
from Kaggle Otto Forum.
Examples
--------
- Refer also to the walk through example in `demo
folder <../demo/guide-python>`__
- See also the `example scripts <../demo/kaggle-higgs>`__ for Kaggle
Higgs Challenge, including `speedtest
script <../demo/kaggle-higgs/speedtest.py>`__ on this dataset.
Note
----
- If you want to build xgboost on Mac OS X with multiprocessing support
where clang in XCode by default doesn't support, please install gcc
4.9 or higher using `homebrew <http://brew.sh/>`__
``brew tap homebrew/versions; brew install gcc49``
- If you want to run XGBoost process in parallel using the fork backend
for joblib/multiprocessing, you must build XGBoost without support
for OpenMP by ``make no_omp=1``. Otherwise, use the forkserver (in
Python 3.4) or spawn backend. See the
`sklearn\_parallel.py <../demo/guide-python/sklearn_parallel.py>`__
demo.
.. |PyPI version| image:: https://badge.fury.io/py/xgboost.svg
:target: http://badge.fury.io/py/xgboost
.. |PyPI downloads| image:: https://img.shields.io/pypi/dm/xgboost.svg
:target: https://pypi.python.org/pypi/xgboost/

View File

@ -0,0 +1,52 @@
XGBoost Python Package Troubleshooting
======================
Windows platform
------------
The current best solution for installing xgboost on windows machine is building from github. Please go to [windows](/windows/), build with the Visual Studio project file, and install. Additional detailed instruction can be found at this [installation tutorial](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13043/run-xgboost-from-windows-and-python) from Kaggle Otto Forum.
`pip install xgboost` is **not** tested nor supported in windows platform for now.
Linux platform (also Mac OS X in general)
------------
**Trouble 0**: I see error messages like this when install from github using `python setup.py install`.
XGBoostLibraryNotFound: Cannot find XGBoost Libarary in the candicate path, did you install compilers and run build.sh in root path?
List of candidates:
/home/dmlc/anaconda/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/libxgboostwrapper.so
/home/dmlc/anaconda/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/../../wrapper/libxgboostwrapper.so
/home/dmlc/anaconda/lib/python2.7/site-packages/xgboost-0.4-py2.7.egg/xgboost/./wrapper/libxgboostwrapper.so
**Solution 0**: Please check if you have:
* installed the latest C++ compilers and `make`, for example `g++` and `gcc` (Linux) or `clang LLVM` (Mac OS X). Recommended compilers are `g++-5` or newer (Linux and Mac), or `clang` comes with Xcode in Mac OS X. For installting compilers, please refer to your system package management commands, e.g. `apt-get` `yum` or `brew`(Mac).
* compilers in your `$PATH`. Try typing `gcc` and see if your have it in your path.
* Do you use other shells than `bash` and install from `pip`? In some old version of pip installation, the shell script used `pushd` for changing directory and triggering the build process, which may failed some shells without `pushd` command. Please update to the latest version by removing the old installation and redo `pip install xgboost`
* Some outdated `make` may not recognize the recent changes in the `Makefile` and gives this error, please update to the latest `make`:
`/usr/lib/ruby/gems/1.8/gems/make-0.3.1/bin/make:4: undefined local variable or method 'make' for main:Object (NameError)`
**Trouble 1**: I see the same error message in **Trouble 0** when install from `pip install xgboost`.
**Solution 1**: the problem is the same as in **Trouble 0**, please see **Solution 0**.
**Trouble 2**: I see this error message when `pip install xgboost`. It says I have `libxgboostwrapper.so` but it is not valid.
OSError: /home/dmlc/anaconda/lib/python2.7/site-packages/xgboost/./wrapper/libxgboostwrapper.so: invalid ELF header
**Solution 2**: Solution is as in 0 and 1 by installing the latest `g++` compiler and the latest `make`. The reason for this rare error is that, `pip` ships with a pre-compiled `libxgboostwrapper.so` with Mac for placeholder for allowing `setup.py` to find the right lib path. If a system doesn't compile, it may refer to this placeholder lib and fail. This placeholder `libxgboostwrapper.so` will be automatically removed and correctly generated by the compiling on-the-fly for the system.
**Trouble 3**: My system's `pip` says it can't find a valid `xgboost` installation release on `PyPI`.
**Solution 3**: Some linux system comes with an old `pip` version. Please update to the latest `pip` by following the official installation document at <http://pip.readthedocs.org/en/stable/installing/>
**Trouble 4**: I tried `python setup.py install` but it says `setuptools` import fail.
**Solution 4**: Please make sure you have [setuptools](https://pypi.python.org/pypi/setuptools) before installing the python package.
Mac OS X (specific)
------------
Most of the troubles and solutions are the same with that in the Linux platform. Mac has the following specific problems.
**Trouble 0**: I successfully installed `xgboost` using github installation/using `pip install xgboost`. But it runs very slow with only single thread, what is going on?
**Solution 0**: `clang LLVM` compiler on Mac OS X from Xcode doesn't support OpenMP multi-thread. An alternative choice is installing `homebrew` <http://brew.sh/> and `brew install g++-5` which provides multi-thread OpenMP support.
**Trouble 1**: Can I install `clang-omp` for supporting OpenMP without using `gcc`?
**Solution 1**: it is not support and may have linking errors.

View File

@ -1,2 +1,2 @@
[metadata] [metadata]
description-file = README.md description-file = README.rst

View File

@ -2,21 +2,10 @@
"""Setup xgboost package.""" """Setup xgboost package."""
from __future__ import absolute_import from __future__ import absolute_import
import sys import sys
from setuptools import setup, find_packages
import subprocess
sys.path.insert(0, '.')
import os import os
#build on the fly if install in pip from setuptools import setup, find_packages
#otherwise, use build.sh in the parent directory #import subprocess
sys.path.insert(0, '.')
if 'pip' in __file__:
if not os.name == 'nt': #if not windows
build_sh = subprocess.Popen(['sh', 'xgboost/build-python.sh'])
build_sh.wait()
output = build_sh.communicate()
print(output)
CURRENT_DIR = os.path.dirname(__file__) CURRENT_DIR = os.path.dirname(__file__)
@ -28,16 +17,13 @@ libpath = {'__file__': libpath_py}
exec(compile(open(libpath_py, "rb").read(), libpath_py, 'exec'), libpath, libpath) exec(compile(open(libpath_py, "rb").read(), libpath_py, 'exec'), libpath, libpath)
LIB_PATH = libpath['find_lib_path']() LIB_PATH = libpath['find_lib_path']()
#print LIB_PATH
#to deploy to pip, please use #Please use setup_pip.py for generating and deploying pip installation
#make pythonpack #detailed instruction in setup_pip.py
#python setup.py register sdist upload
#and be sure to test it firstly using "python setup.py register sdist upload -r pypitest"
setup(name='xgboost', setup(name='xgboost',
version=open(os.path.join(CURRENT_DIR, 'xgboost/VERSION')).read().strip(), version=open(os.path.join(CURRENT_DIR, 'xgboost/VERSION')).read().strip(),
#version='0.4a13', #version='0.4a23',
description=open(os.path.join(CURRENT_DIR, 'README.md')).read(), description=open(os.path.join(CURRENT_DIR, 'README.rst')).read(),
install_requires=[ install_requires=[
'numpy', 'numpy',
'scipy', 'scipy',
@ -46,10 +32,6 @@ setup(name='xgboost',
maintainer_email='phunter.lau@gmail.com', maintainer_email='phunter.lau@gmail.com',
zip_safe=False, zip_safe=False,
packages=find_packages(), packages=find_packages(),
#don't need this and don't use this, give everything to MANIFEST.in
#package_dir = {'':'xgboost'},
#package_data = {'': ['*.txt','*.md','*.sh'],
# }
#this will use MANIFEST.in during install where we specify additional files, #this will use MANIFEST.in during install where we specify additional files,
#this is the golden line #this is the golden line
include_package_data=True, include_package_data=True,

View File

@ -0,0 +1,58 @@
# pylint: disable=invalid-name, exec-used
"""Setup xgboost package."""
from __future__ import absolute_import
import sys
import os
from setuptools import setup, find_packages
#import subprocess
sys.path.insert(0, '.')
#this script is for packing and shipping pip installation
#it builds xgboost code on the fly and packs for pip
#please don't use this file for installing from github
if os.name != 'nt': #if not windows, compile and install
os.system('sh ./xgboost/build-python.sh')
else:
print('Windows users please use github installation.')
sys.exit()
CURRENT_DIR = os.path.dirname(__file__)
# We can not import `xgboost.libpath` in setup.py directly since xgboost/__init__.py
# import `xgboost.core` and finally will import `numpy` and `scipy` which are setup
# `install_requires`. That's why we're using `exec` here.
libpath_py = os.path.join(CURRENT_DIR, 'xgboost/libpath.py')
libpath = {'__file__': libpath_py}
exec(compile(open(libpath_py, "rb").read(), libpath_py, 'exec'), libpath, libpath)
LIB_PATH = libpath['find_lib_path']()
#to deploy to pip, please use
#make pythonpack
#python setup.py register sdist upload
#and be sure to test it firstly using "python setup.py register sdist upload -r pypitest"
setup(name='xgboost',
#version=open(os.path.join(CURRENT_DIR, 'xgboost/VERSION')).read().strip(),
version='0.4a30',
description=open(os.path.join(CURRENT_DIR, 'README.rst')).read(),
install_requires=[
'numpy',
'scipy',
],
maintainer='Hongliang Liu',
maintainer_email='phunter.lau@gmail.com',
zip_safe=False,
packages=find_packages(),
#don't need this and don't use this, give everything to MANIFEST.in
#package_dir = {'':'xgboost'},
#package_data = {'': ['*.txt','*.md','*.sh'],
# }
#this will use MANIFEST.in during install where we specify additional files,
#this is the golden line
include_package_data=True,
#!!! don't use data_files for creating pip installation,
#otherwise install_data process will copy it to
#root directory for some machines, and cause confusions on building
#data_files=[('xgboost', LIB_PATH)],
url='https://github.com/dmlc/xgboost')

View File

@ -10,8 +10,11 @@ import os
from .core import DMatrix, Booster from .core import DMatrix, Booster
from .training import train, cv from .training import train, cv
from .sklearn import XGBModel, XGBClassifier, XGBRegressor try:
from .plotting import plot_importance, plot_tree, to_graphviz from .sklearn import XGBModel, XGBClassifier, XGBRegressor
from .plotting import plot_importance, plot_tree, to_graphviz
except ImportError:
print('Error when loading sklearn/plotting. Please install scikit-learn')
VERSION_FILE = os.path.join(os.path.dirname(__file__), 'VERSION') VERSION_FILE = os.path.join(os.path.dirname(__file__), 'VERSION')
__version__ = open(VERSION_FILE).read().strip() __version__ = open(VERSION_FILE).read().strip()

View File

@ -10,7 +10,11 @@
# conflict with build.sh which is for everything. # conflict with build.sh which is for everything.
pushd xgboost #pushd xgboost
oldpath=`pwd`
cd ./xgboost/
#remove the pre-compiled .so and trigger the system's on-the-fly compiling
make clean
if make python; then if make python; then
echo "Successfully build multi-thread xgboost" echo "Successfully build multi-thread xgboost"
else else
@ -23,4 +27,4 @@ else
echo "If you want multi-threaded version" echo "If you want multi-threaded version"
echo "See additional instructions in doc/build.md" echo "See additional instructions in doc/build.md"
fi fi
popd cd $oldpath

View File

@ -0,0 +1,47 @@
# coding: utf-8
# pylint: disable=unused-import, invalid-name, wrong-import-position
"""For compatibility"""
from __future__ import absolute_import
import sys
PY3 = (sys.version_info[0] == 3)
if PY3:
# pylint: disable=invalid-name, redefined-builtin
STRING_TYPES = str,
else:
# pylint: disable=invalid-name
STRING_TYPES = basestring,
# pandas
try:
from pandas import DataFrame
PANDAS_INSTALLED = True
except ImportError:
class DataFrame(object):
""" dummy for pandas.DataFrame """
pass
PANDAS_INSTALLED = False
# sklearn
try:
from sklearn.base import BaseEstimator
from sklearn.base import RegressorMixin, ClassifierMixin
from sklearn.preprocessing import LabelEncoder
SKLEARN_INSTALLED = True
XGBModelBase = BaseEstimator
XGBRegressorBase = RegressorMixin
XGBClassifierBase = ClassifierMixin
except ImportError:
SKLEARN_INSTALLED = False
# used for compatiblity without sklearn
XGBModelBase = object
XGBClassifierBase = object
XGBRegressorBase = object

View File

@ -4,7 +4,6 @@
from __future__ import absolute_import from __future__ import absolute_import
import os import os
import sys
import ctypes import ctypes
import collections import collections
@ -13,20 +12,12 @@ import scipy.sparse
from .libpath import find_lib_path from .libpath import find_lib_path
from .compat import STRING_TYPES, PY3, DataFrame
class XGBoostError(Exception): class XGBoostError(Exception):
"""Error throwed by xgboost trainer.""" """Error throwed by xgboost trainer."""
pass pass
PY3 = (sys.version_info[0] == 3)
if PY3:
# pylint: disable=invalid-name, redefined-builtin
STRING_TYPES = str,
else:
# pylint: disable=invalid-name
STRING_TYPES = basestring,
def from_pystr_to_cstr(data): def from_pystr_to_cstr(data):
"""Convert a list of Python str to C pointer """Convert a list of Python str to C pointer
@ -138,28 +129,50 @@ def c_array(ctype, values):
return (ctype * len(values))(*values) return (ctype * len(values))(*values)
def _maybe_from_pandas(data, feature_names, feature_types):
""" Extract internal data from pd.DataFrame """ PANDAS_DTYPE_MAPPER = {'int8': 'int', 'int16': 'int', 'int32': 'int', 'int64': 'int',
try: 'uint8': 'int', 'uint16': 'int', 'uint32': 'int', 'uint64': 'int',
import pandas as pd 'float16': 'float', 'float32': 'float', 'float64': 'float',
except ImportError: 'bool': 'i'}
def _maybe_pandas_data(data, feature_names, feature_types):
""" Extract internal data from pd.DataFrame for DMatrix data """
if not isinstance(data, DataFrame):
return data, feature_names, feature_types return data, feature_names, feature_types
if not isinstance(data, pd.DataFrame): data_dtypes = data.dtypes
return data, feature_names, feature_types if not all(dtype.name in PANDAS_DTYPE_MAPPER for dtype in data_dtypes):
raise ValueError('DataFrame.dtypes for data must be int, float or bool')
dtypes = data.dtypes
if not all(dtype.name in ('int64', 'float64', 'bool') for dtype in dtypes):
raise ValueError('DataFrame.dtypes must be int, float or bool')
if feature_names is None: if feature_names is None:
feature_names = data.columns.format() feature_names = data.columns.format()
if feature_types is None: if feature_types is None:
mapper = {'int64': 'int', 'float64': 'q', 'bool': 'i'} feature_types = [PANDAS_DTYPE_MAPPER[dtype.name] for dtype in data_dtypes]
feature_types = [mapper[dtype.name] for dtype in dtypes]
data = data.values.astype('float') data = data.values.astype('float')
return data, feature_names, feature_types return data, feature_names, feature_types
def _maybe_pandas_label(label):
""" Extract internal data from pd.DataFrame for DMatrix label """
if isinstance(label, DataFrame):
if len(label.columns) > 1:
raise ValueError('DataFrame for label cannot have multiple columns')
label_dtypes = label.dtypes
if not all(dtype.name in PANDAS_DTYPE_MAPPER for dtype in label_dtypes):
raise ValueError('DataFrame.dtypes for label must be int, float or bool')
else:
label = label.values.astype('float')
# pd.Series can be passed to xgb as it is
return label
class DMatrix(object): class DMatrix(object):
"""Data Matrix used in XGBoost. """Data Matrix used in XGBoost.
@ -192,20 +205,19 @@ class DMatrix(object):
silent : boolean, optional silent : boolean, optional
Whether print messages during construction Whether print messages during construction
feature_names : list, optional feature_names : list, optional
Labels for features. Set names for features.
feature_types : list, optional feature_types : list, optional
Labels for features. Set types for features.
""" """
# force into void_p, mac need to pass things in as void_p # force into void_p, mac need to pass things in as void_p
if data is None: if data is None:
self.handle = None self.handle = None
return return
klass = getattr(getattr(data, '__class__', None), '__name__', None) data, feature_names, feature_types = _maybe_pandas_data(data,
if klass == 'DataFrame': feature_names,
# once check class name to avoid unnecessary pandas import feature_types)
data, feature_names, feature_types = _maybe_from_pandas(data, feature_names, label = _maybe_pandas_label(label)
feature_types)
if isinstance(data, STRING_TYPES): if isinstance(data, STRING_TYPES):
self.handle = ctypes.c_void_p() self.handle = ctypes.c_void_p()
@ -223,7 +235,7 @@ class DMatrix(object):
csr = scipy.sparse.csr_matrix(data) csr = scipy.sparse.csr_matrix(data)
self._init_from_csr(csr) self._init_from_csr(csr)
except: except:
raise TypeError('can not intialize DMatrix from {}'.format(type(data).__name__)) raise TypeError('can not initialize DMatrix from {}'.format(type(data).__name__))
if label is not None: if label is not None:
self.set_label(label) self.set_label(label)
if weight is not None: if weight is not None:
@ -511,7 +523,7 @@ class DMatrix(object):
feature_names : list or None feature_names : list or None
Labels for features. None will reset existing feature names Labels for features. None will reset existing feature names
""" """
if not feature_names is None: if feature_names is not None:
# validate feature name # validate feature name
if not isinstance(feature_names, list): if not isinstance(feature_names, list):
feature_names = list(feature_names) feature_names = list(feature_names)
@ -520,10 +532,11 @@ class DMatrix(object):
if len(feature_names) != self.num_col(): if len(feature_names) != self.num_col():
msg = 'feature_names must have the same length as data' msg = 'feature_names must have the same length as data'
raise ValueError(msg) raise ValueError(msg)
# prohibit to use symbols may affect to parse. e.g. ``[]=.`` # prohibit to use symbols may affect to parse. e.g. []<
if not all(isinstance(f, STRING_TYPES) and f.isalnum() if not all(isinstance(f, STRING_TYPES) and
not any(x in f for x in set(('[', ']', '<')))
for f in feature_names): for f in feature_names):
raise ValueError('all feature_names must be alphanumerics') raise ValueError('feature_names may not contain [, ] or <')
else: else:
# reset feature_types also # reset feature_types also
self.feature_types = None self.feature_types = None
@ -541,7 +554,7 @@ class DMatrix(object):
feature_types : list or None feature_types : list or None
Labels for features. None will reset existing feature names Labels for features. None will reset existing feature names
""" """
if not feature_types is None: if feature_types is not None:
if self.feature_names is None: if self.feature_names is None:
msg = 'Unable to set feature types before setting names' msg = 'Unable to set feature types before setting names'
@ -556,12 +569,11 @@ class DMatrix(object):
if len(feature_types) != self.num_col(): if len(feature_types) != self.num_col():
msg = 'feature_types must have the same length as data' msg = 'feature_types must have the same length as data'
raise ValueError(msg) raise ValueError(msg)
# prohibit to use symbols may affect to parse. e.g. ``[]=.``
valid = ('q', 'i', 'int', 'float') valid = ('int', 'float', 'i', 'q')
if not all(isinstance(f, STRING_TYPES) and f in valid if not all(isinstance(f, STRING_TYPES) and f in valid
for f in feature_types): for f in feature_types):
raise ValueError('all feature_names must be {i, q, int, float}') raise ValueError('All feature_names must be {int, float, i, q}')
self._feature_types = feature_types self._feature_types = feature_types
@ -745,8 +757,13 @@ class Booster(object):
else: else:
res = '[%d]' % iteration res = '[%d]' % iteration
for dmat, evname in evals: for dmat, evname in evals:
name, val = feval(self.predict(dmat), dmat) feval_ret = feval(self.predict(dmat), dmat)
res += '\t%s-%s:%f' % (evname, name, val) if isinstance(feval_ret, list):
for name, val in feval_ret:
res += '\t%s-%s:%f' % (evname, name, val)
else:
name, val = feval_ret
res += '\t%s-%s:%f' % (evname, name, val)
return res return res
def eval(self, data, name='eval', iteration=0): def eval(self, data, name='eval', iteration=0):
@ -873,6 +890,7 @@ class Booster(object):
_check_call(_LIB.XGBoosterLoadModelFromBuffer(self.handle, ptr, length)) _check_call(_LIB.XGBoosterLoadModelFromBuffer(self.handle, ptr, length))
def dump_model(self, fout, fmap='', with_stats=False): def dump_model(self, fout, fmap='', with_stats=False):
# pylint: disable=consider-using-enumerate
""" """
Dump model into a text file. Dump model into a text file.

View File

@ -36,9 +36,10 @@ def find_lib_path():
else: else:
dll_path = [os.path.join(p, 'libxgboostwrapper.so') for p in dll_path] dll_path = [os.path.join(p, 'libxgboostwrapper.so') for p in dll_path]
lib_path = [p for p in dll_path if os.path.exists(p) and os.path.isfile(p)] lib_path = [p for p in dll_path if os.path.exists(p) and os.path.isfile(p)]
#From github issues, most of installation errors come from machines w/o compilers
if len(lib_path) == 0 and not os.environ.get('XGBOOST_BUILD_DOC', False): if len(lib_path) == 0 and not os.environ.get('XGBOOST_BUILD_DOC', False):
raise XGBoostLibraryNotFound( raise XGBoostLibraryNotFound(
'Cannot find XGBoost Libarary in the candicate path, ' + 'Cannot find XGBoost Libarary in the candicate path, ' +
'did you run build.sh in root path?\n' 'did you install compilers and run build.sh in root path?\n'
'List of candidates:\n' + ('\n'.join(dll_path))) 'List of candidates:\n' + ('\n'.join(dll_path)))
return lib_path return lib_path

View File

@ -5,13 +5,13 @@
from __future__ import absolute_import from __future__ import absolute_import
import re import re
from io import BytesIO
import numpy as np import numpy as np
from .core import Booster from .core import Booster
from .sklearn import XGBModel
from io import BytesIO
def plot_importance(booster, ax=None, height=0.2, def plot_importance(booster, ax=None, height=0.2,
xlim=None, title='Feature importance', xlim=None, ylim=None, title='Feature importance',
xlabel='F score', ylabel='Features', xlabel='F score', ylabel='Features',
grid=True, **kwargs): grid=True, **kwargs):
@ -19,14 +19,16 @@ def plot_importance(booster, ax=None, height=0.2,
Parameters Parameters
---------- ----------
booster : Booster or dict booster : Booster, XGBModel or dict
Booster instance, or dict taken by Booster.get_fscore() Booster or XGBModel instance, or dict taken by Booster.get_fscore()
ax : matplotlib Axes, default None ax : matplotlib Axes, default None
Target axes instance. If None, new figure and axes will be created. Target axes instance. If None, new figure and axes will be created.
height : float, default 0.2 height : float, default 0.2
Bar height, passed to ax.barh() Bar height, passed to ax.barh()
xlim : tuple, default None xlim : tuple, default None
Tuple passed to axes.xlim() Tuple passed to axes.xlim()
ylim : tuple, default None
Tuple passed to axes.ylim()
title : str, default "Feature importance" title : str, default "Feature importance"
Axes title. To disable, pass None. Axes title. To disable, pass None.
xlabel : str, default "F score" xlabel : str, default "F score"
@ -46,12 +48,14 @@ def plot_importance(booster, ax=None, height=0.2,
except ImportError: except ImportError:
raise ImportError('You must install matplotlib to plot importance') raise ImportError('You must install matplotlib to plot importance')
if isinstance(booster, Booster): if isinstance(booster, XGBModel):
importance = booster.booster().get_fscore()
elif isinstance(booster, Booster):
importance = booster.get_fscore() importance = booster.get_fscore()
elif isinstance(booster, dict): elif isinstance(booster, dict):
importance = booster importance = booster
else: else:
raise ValueError('tree must be Booster or dict instance') raise ValueError('tree must be Booster, XGBModel or dict instance')
if len(importance) == 0: if len(importance) == 0:
raise ValueError('Booster.get_fscore() results in empty') raise ValueError('Booster.get_fscore() results in empty')
@ -73,12 +77,19 @@ def plot_importance(booster, ax=None, height=0.2,
ax.set_yticklabels(labels) ax.set_yticklabels(labels)
if xlim is not None: if xlim is not None:
if not isinstance(xlim, tuple) or len(xlim, 2): if not isinstance(xlim, tuple) or len(xlim) != 2:
raise ValueError('xlim must be a tuple of 2 elements') raise ValueError('xlim must be a tuple of 2 elements')
else: else:
xlim = (0, max(values) * 1.1) xlim = (0, max(values) * 1.1)
ax.set_xlim(xlim) ax.set_xlim(xlim)
if ylim is not None:
if not isinstance(ylim, tuple) or len(ylim) != 2:
raise ValueError('ylim must be a tuple of 2 elements')
else:
ylim = (-1, len(importance))
ax.set_ylim(ylim)
if title is not None: if title is not None:
ax.set_title(title) ax.set_title(title)
if xlabel is not None: if xlabel is not None:
@ -142,8 +153,8 @@ def to_graphviz(booster, num_trees=0, rankdir='UT',
Parameters Parameters
---------- ----------
booster : Booster booster : Booster, XGBModel
Booster instance Booster or XGBModel instance
num_trees : int, default 0 num_trees : int, default 0
Specify the ordinal number of target tree Specify the ordinal number of target tree
rankdir : str, default "UT" rankdir : str, default "UT"
@ -165,8 +176,11 @@ def to_graphviz(booster, num_trees=0, rankdir='UT',
except ImportError: except ImportError:
raise ImportError('You must install graphviz to plot tree') raise ImportError('You must install graphviz to plot tree')
if not isinstance(booster, Booster): if not isinstance(booster, (Booster, XGBModel)):
raise ValueError('booster must be Booster instance') raise ValueError('booster must be Booster or XGBModel instance')
if isinstance(booster, XGBModel):
booster = booster.booster()
tree = booster.get_dump()[num_trees] tree = booster.get_dump()[num_trees]
tree = tree.split() tree = tree.split()
@ -193,8 +207,8 @@ def plot_tree(booster, num_trees=0, rankdir='UT', ax=None, **kwargs):
Parameters Parameters
---------- ----------
booster : Booster booster : Booster, XGBModel
Booster instance Booster or XGBModel instance
num_trees : int, default 0 num_trees : int, default 0
Specify the ordinal number of target tree Specify the ordinal number of target tree
rankdir : str, default "UT" rankdir : str, default "UT"
@ -216,7 +230,6 @@ def plot_tree(booster, num_trees=0, rankdir='UT', ax=None, **kwargs):
except ImportError: except ImportError:
raise ImportError('You must install matplotlib to plot tree') raise ImportError('You must install matplotlib to plot tree')
if ax is None: if ax is None:
_, ax = plt.subplots(1, 1) _, ax = plt.subplots(1, 1)

View File

@ -7,23 +7,9 @@ import numpy as np
from .core import Booster, DMatrix, XGBoostError from .core import Booster, DMatrix, XGBoostError
from .training import train from .training import train
try: from .compat import (SKLEARN_INSTALLED, XGBModelBase,
from sklearn.base import BaseEstimator XGBClassifierBase, XGBRegressorBase, LabelEncoder)
from sklearn.base import RegressorMixin, ClassifierMixin
from sklearn.preprocessing import LabelEncoder
SKLEARN_INSTALLED = True
except ImportError:
SKLEARN_INSTALLED = False
# used for compatiblity without sklearn
XGBModelBase = object
XGBClassifierBase = object
XGBRegressorBase = object
if SKLEARN_INSTALLED:
XGBModelBase = BaseEstimator
XGBRegressorBase = RegressorMixin
XGBClassifierBase = ClassifierMixin
class XGBModel(XGBModelBase): class XGBModel(XGBModelBase):
# pylint: disable=too-many-arguments, too-many-instance-attributes, invalid-name # pylint: disable=too-many-arguments, too-many-instance-attributes, invalid-name
@ -54,6 +40,14 @@ class XGBModel(XGBModelBase):
Subsample ratio of the training instance. Subsample ratio of the training instance.
colsample_bytree : float colsample_bytree : float
Subsample ratio of columns when constructing each tree. Subsample ratio of columns when constructing each tree.
colsample_bylevel : float
Subsample ratio of columns for each split, in each level.
reg_alpha : float (xgb's alpha)
L2 regularization term on weights
reg_lambda : float (xgb's lambda)
L1 regularization term on weights
scale_pos_weight : float
Balancing of positive and negative weights.
base_score: base_score:
The initial prediction score of all instances, global bias. The initial prediction score of all instances, global bias.
@ -66,7 +60,8 @@ class XGBModel(XGBModelBase):
def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100, def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
silent=True, objective="reg:linear", silent=True, objective="reg:linear",
nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0,
subsample=1, colsample_bytree=1, subsample=1, colsample_bytree=1, colsample_bylevel=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
base_score=0.5, seed=0, missing=None): base_score=0.5, seed=0, missing=None):
if not SKLEARN_INSTALLED: if not SKLEARN_INSTALLED:
raise XGBoostError('sklearn needs to be installed in order to use this module') raise XGBoostError('sklearn needs to be installed in order to use this module')
@ -82,6 +77,10 @@ class XGBModel(XGBModelBase):
self.max_delta_step = max_delta_step self.max_delta_step = max_delta_step
self.subsample = subsample self.subsample = subsample
self.colsample_bytree = colsample_bytree self.colsample_bytree = colsample_bytree
self.colsample_bylevel = colsample_bylevel
self.reg_alpha = reg_alpha
self.reg_lambda = reg_lambda
self.scale_pos_weight = scale_pos_weight
self.base_score = base_score self.base_score = base_score
self.seed = seed self.seed = seed
@ -131,7 +130,7 @@ class XGBModel(XGBModelBase):
def fit(self, X, y, eval_set=None, eval_metric=None, def fit(self, X, y, eval_set=None, eval_metric=None,
early_stopping_rounds=None, verbose=True): early_stopping_rounds=None, verbose=True):
# pylint: disable=missing-docstring,invalid-name,attribute-defined-outside-init # pylint: disable=missing-docstring,invalid-name,attribute-defined-outside-init, redefined-variable-type
""" """
Fit the gradient boosting model Fit the gradient boosting model
@ -165,7 +164,7 @@ class XGBModel(XGBModelBase):
""" """
trainDmatrix = DMatrix(X, label=y, missing=self.missing) trainDmatrix = DMatrix(X, label=y, missing=self.missing)
eval_results = {} evals_result = {}
if eval_set is not None: if eval_set is not None:
evals = list(DMatrix(x[0], label=x[1]) for x in eval_set) evals = list(DMatrix(x[0], label=x[1]) for x in eval_set)
evals = list(zip(evals, ["validation_{}".format(i) for i in evals = list(zip(evals, ["validation_{}".format(i) for i in
@ -185,23 +184,62 @@ class XGBModel(XGBModelBase):
self._Booster = train(params, trainDmatrix, self._Booster = train(params, trainDmatrix,
self.n_estimators, evals=evals, self.n_estimators, evals=evals,
early_stopping_rounds=early_stopping_rounds, early_stopping_rounds=early_stopping_rounds,
evals_result=eval_results, feval=feval, evals_result=evals_result, feval=feval,
verbose_eval=verbose) verbose_eval=verbose)
if eval_results:
eval_results = {k: np.array(v, dtype=float) if evals_result:
for k, v in eval_results.items()} for val in evals_result.items():
eval_results = {k: np.array(v) for k, v in eval_results.items()} evals_result_key = list(val[1].keys())[0]
self.eval_results = eval_results evals_result[val[0]][evals_result_key] = val[1][evals_result_key]
self.evals_result_ = evals_result
if early_stopping_rounds is not None: if early_stopping_rounds is not None:
self.best_score = self._Booster.best_score self.best_score = self._Booster.best_score
self.best_iteration = self._Booster.best_iteration self.best_iteration = self._Booster.best_iteration
return self return self
def predict(self, data): def predict(self, data, output_margin=False, ntree_limit=0):
# pylint: disable=missing-docstring,invalid-name # pylint: disable=missing-docstring,invalid-name
test_dmatrix = DMatrix(data, missing=self.missing) test_dmatrix = DMatrix(data, missing=self.missing)
return self.booster().predict(test_dmatrix) return self.booster().predict(test_dmatrix,
output_margin=output_margin,
ntree_limit=ntree_limit)
def evals_result(self):
"""Return the evaluation results.
If eval_set is passed to the `fit` function, you can call evals_result() to
get evaluation results for all passed eval_sets. When eval_metric is also
passed to the `fit` function, the evals_result will contain the eval_metrics
passed to the `fit` function
Returns
-------
evals_result : dictionary
Example
-------
param_dist = {'objective':'binary:logistic', 'n_estimators':2}
clf = xgb.XGBModel(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
The variable evals_result will contain:
{'validation_0': {'logloss': ['0.604835', '0.531479']},
'validation_1': {'logloss': ['0.41965', '0.17686']}}
"""
if self.evals_result_:
evals_result = self.evals_result_
else:
raise XGBoostError('No results.')
return evals_result
class XGBClassifier(XGBModel, XGBClassifierBase): class XGBClassifier(XGBModel, XGBClassifierBase):
@ -214,18 +252,20 @@ class XGBClassifier(XGBModel, XGBClassifierBase):
n_estimators=100, silent=True, n_estimators=100, silent=True,
objective="binary:logistic", objective="binary:logistic",
nthread=-1, gamma=0, min_child_weight=1, nthread=-1, gamma=0, min_child_weight=1,
max_delta_step=0, subsample=1, colsample_bytree=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
base_score=0.5, seed=0, missing=None): base_score=0.5, seed=0, missing=None):
super(XGBClassifier, self).__init__(max_depth, learning_rate, super(XGBClassifier, self).__init__(max_depth, learning_rate,
n_estimators, silent, objective, n_estimators, silent, objective,
nthread, gamma, min_child_weight, nthread, gamma, min_child_weight,
max_delta_step, subsample, max_delta_step, subsample,
colsample_bytree, colsample_bytree, colsample_bylevel,
base_score, seed, missing) reg_alpha, reg_lambda,
scale_pos_weight, base_score, seed, missing)
def fit(self, X, y, sample_weight=None, eval_set=None, eval_metric=None, def fit(self, X, y, sample_weight=None, eval_set=None, eval_metric=None,
early_stopping_rounds=None, verbose=True): early_stopping_rounds=None, verbose=True):
# pylint: disable = attribute-defined-outside-init,arguments-differ # pylint: disable = attribute-defined-outside-init,arguments-differ, redefined-variable-type
""" """
Fit gradient boosting classifier Fit gradient boosting classifier
@ -259,7 +299,7 @@ class XGBClassifier(XGBModel, XGBClassifierBase):
If `verbose` and an evaluation set is used, writes the evaluation If `verbose` and an evaluation set is used, writes the evaluation
metric measured on the validation set to stderr. metric measured on the validation set to stderr.
""" """
eval_results = {} evals_result = {}
self.classes_ = list(np.unique(y)) self.classes_ = list(np.unique(y))
self.n_classes_ = len(self.classes_) self.n_classes_ = len(self.classes_)
if self.n_classes_ > 2: if self.n_classes_ > 2:
@ -299,13 +339,14 @@ class XGBClassifier(XGBModel, XGBClassifierBase):
self._Booster = train(xgb_options, train_dmatrix, self.n_estimators, self._Booster = train(xgb_options, train_dmatrix, self.n_estimators,
evals=evals, evals=evals,
early_stopping_rounds=early_stopping_rounds, early_stopping_rounds=early_stopping_rounds,
evals_result=eval_results, feval=feval, evals_result=evals_result, feval=feval,
verbose_eval=verbose) verbose_eval=verbose)
if eval_results: if evals_result:
eval_results = {k: np.array(v, dtype=float) for val in evals_result.items():
for k, v in eval_results.items()} evals_result_key = list(val[1].keys())[0]
self.eval_results = eval_results evals_result[val[0]][evals_result_key] = val[1][evals_result_key]
self.evals_result_ = evals_result
if early_stopping_rounds is not None: if early_stopping_rounds is not None:
self.best_score = self._Booster.best_score self.best_score = self._Booster.best_score
@ -313,9 +354,11 @@ class XGBClassifier(XGBModel, XGBClassifierBase):
return self return self
def predict(self, data): def predict(self, data, output_margin=False, ntree_limit=0):
test_dmatrix = DMatrix(data, missing=self.missing) test_dmatrix = DMatrix(data, missing=self.missing)
class_probs = self.booster().predict(test_dmatrix) class_probs = self.booster().predict(test_dmatrix,
output_margin=output_margin,
ntree_limit=ntree_limit)
if len(class_probs.shape) > 1: if len(class_probs.shape) > 1:
column_indexes = np.argmax(class_probs, axis=1) column_indexes = np.argmax(class_probs, axis=1)
else: else:
@ -323,9 +366,11 @@ class XGBClassifier(XGBModel, XGBClassifierBase):
column_indexes[class_probs > 0.5] = 1 column_indexes[class_probs > 0.5] = 1
return self._le.inverse_transform(column_indexes) return self._le.inverse_transform(column_indexes)
def predict_proba(self, data): def predict_proba(self, data, output_margin=False, ntree_limit=0):
test_dmatrix = DMatrix(data, missing=self.missing) test_dmatrix = DMatrix(data, missing=self.missing)
class_probs = self.booster().predict(test_dmatrix) class_probs = self.booster().predict(test_dmatrix,
output_margin=output_margin,
ntree_limit=ntree_limit)
if self.objective == "multi:softprob": if self.objective == "multi:softprob":
return class_probs return class_probs
else: else:
@ -333,6 +378,42 @@ class XGBClassifier(XGBModel, XGBClassifierBase):
classzero_probs = 1.0 - classone_probs classzero_probs = 1.0 - classone_probs
return np.vstack((classzero_probs, classone_probs)).transpose() return np.vstack((classzero_probs, classone_probs)).transpose()
def evals_result(self):
"""Return the evaluation results.
If eval_set is passed to the `fit` function, you can call evals_result() to
get evaluation results for all passed eval_sets. When eval_metric is also
passed to the `fit` function, the evals_result will contain the eval_metrics
passed to the `fit` function
Returns
-------
evals_result : dictionary
Example
-------
param_dist = {'objective':'binary:logistic', 'n_estimators':2}
clf = xgb.XGBClassifier(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
evals_result = clf.evals_result()
The variable evals_result will contain:
{'validation_0': {'logloss': ['0.604835', '0.531479']},
'validation_1': {'logloss': ['0.41965', '0.17686']}}
"""
if self.evals_result_:
evals_result = self.evals_result_
else:
raise XGBoostError('No results.')
return evals_result
class XGBRegressor(XGBModel, XGBRegressorBase): class XGBRegressor(XGBModel, XGBRegressorBase):
# pylint: disable=missing-docstring # pylint: disable=missing-docstring
__doc__ = """Implementation of the scikit-learn API for XGBoost regression. __doc__ = """Implementation of the scikit-learn API for XGBoost regression.

View File

@ -10,7 +10,8 @@ import numpy as np
from .core import Booster, STRING_TYPES from .core import Booster, STRING_TYPES
def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
early_stopping_rounds=None, evals_result=None, verbose_eval=True): maximize=False, early_stopping_rounds=None, evals_result=None,
verbose_eval=True, learning_rates=None, xgb_model=None):
# pylint: disable=too-many-statements,too-many-branches, attribute-defined-outside-init # pylint: disable=too-many-statements,too-many-branches, attribute-defined-outside-init
"""Train a booster with given parameters. """Train a booster with given parameters.
@ -29,26 +30,83 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
Customized objective function. Customized objective function.
feval : function feval : function
Customized evaluation function. Customized evaluation function.
maximize : bool
Whether to maximize feval.
early_stopping_rounds: int early_stopping_rounds: int
Activates early stopping. Validation error needs to decrease at least Activates early stopping. Validation error needs to decrease at least
every <early_stopping_rounds> round(s) to continue training. every <early_stopping_rounds> round(s) to continue training.
Requires at least one item in evals. Requires at least one item in evals.
If there's more than one, will use the last. If there's more than one, will use the last.
Returns the model from the last iteration (not the best one). Returns the model from the last iteration (not the best one).
If early stopping occurs, the model will have two additional fields: If early stopping occurs, the model will have three additional fields:
bst.best_score and bst.best_iteration. bst.best_score, bst.best_iteration and bst.best_ntree_limit.
(Use bst.best_ntree_limit to get the correct value if num_parallel_tree
and/or num_class appears in the parameters)
evals_result: dict evals_result: dict
This dictionary stores the evaluation results of all the items in watchlist This dictionary stores the evaluation results of all the items in watchlist.
verbose_eval : bool Example: with a watchlist containing [(dtest,'eval'), (dtrain,'train')] and
If `verbose_eval` then the evaluation metric on the validation set, if and a paramater containing ('eval_metric', 'logloss')
given, is printed at each boosting stage. Returns: {'train': {'logloss': ['0.48253', '0.35953']},
'eval': {'logloss': ['0.480385', '0.357756']}}
verbose_eval : bool or int
Requires at least one item in evals.
If `verbose_eval` is True then the evaluation metric on the validation set is
printed at each boosting stage.
If `verbose_eval` is an integer then the evaluation metric on the validation set
is printed at every given `verbose_eval` boosting stage. The last boosting stage
/ the boosting stage found by using `early_stopping_rounds` is also printed.
Example: with verbose_eval=4 and at least one item in evals, an evaluation metric
is printed every 4 boosting stages, instead of every boosting stage.
learning_rates: list or function
List of learning rate for each boosting round
or a customized function that calculates eta in terms of
current number of round and the total number of boosting round (e.g. yields
learning rate decay)
- list l: eta = l[boosting round]
- function f: eta = f(boosting round, num_boost_round)
xgb_model : file name of stored xgb model or 'Booster' instance
Xgb model to be loaded before training (allows training continuation).
Returns Returns
------- -------
booster : a trained booster model booster : a trained booster model
""" """
evals = list(evals) evals = list(evals)
if isinstance(params, dict) \
and 'eval_metric' in params \
and isinstance(params['eval_metric'], list):
params = dict((k, v) for k, v in params.items())
eval_metrics = params['eval_metric']
params.pop("eval_metric", None)
params = list(params.items())
for eval_metric in eval_metrics:
params += [('eval_metric', eval_metric)]
bst = Booster(params, [dtrain] + [d[0] for d in evals]) bst = Booster(params, [dtrain] + [d[0] for d in evals])
nboost = 0
num_parallel_tree = 1
if isinstance(verbose_eval, bool):
verbose_eval_every_line = False
else:
if isinstance(verbose_eval, int):
verbose_eval_every_line = verbose_eval
verbose_eval = True if verbose_eval_every_line > 0 else False
if xgb_model is not None:
if not isinstance(xgb_model, STRING_TYPES):
xgb_model = xgb_model.save_raw()
bst = Booster(params, [dtrain] + [d[0] for d in evals], model_file=xgb_model)
nboost = len(bst.get_dump())
else:
bst = Booster(params, [dtrain] + [d[0] for d in evals])
_params = dict(params) if isinstance(params, list) else params
if 'num_parallel_tree' in _params:
num_parallel_tree = _params['num_parallel_tree']
nboost //= num_parallel_tree
if 'num_class' in _params:
nboost //= _params['num_class']
if evals_result is not None: if evals_result is not None:
if not isinstance(evals_result, dict): if not isinstance(evals_result, dict):
@ -56,11 +114,12 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
else: else:
evals_name = [d[1] for d in evals] evals_name = [d[1] for d in evals]
evals_result.clear() evals_result.clear()
evals_result.update({key: [] for key in evals_name}) evals_result.update(dict([(key, {}) for key in evals_name]))
if not early_stopping_rounds: if not early_stopping_rounds:
for i in range(num_boost_round): for i in range(num_boost_round):
bst.update(dtrain, i, obj) bst.update(dtrain, i, obj)
nboost += 1
if len(evals) != 0: if len(evals) != 0:
bst_eval_set = bst.eval_set(evals, i, feval) bst_eval_set = bst.eval_set(evals, i, feval)
if isinstance(bst_eval_set, STRING_TYPES): if isinstance(bst_eval_set, STRING_TYPES):
@ -69,11 +128,27 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
msg = bst_eval_set.decode() msg = bst_eval_set.decode()
if verbose_eval: if verbose_eval:
sys.stderr.write(msg + '\n') if verbose_eval_every_line:
if i % verbose_eval_every_line == 0 or i == num_boost_round - 1:
sys.stderr.write(msg + '\n')
else:
sys.stderr.write(msg + '\n')
if evals_result is not None: if evals_result is not None:
res = re.findall(":-?([0-9.]+).", msg) res = re.findall("([0-9a-zA-Z@]+[-]*):-?([0-9.]+).", msg)
for key, val in zip(evals_name, res): for key in evals_name:
evals_result[key].append(val) evals_idx = evals_name.index(key)
res_per_eval = len(res) // len(evals_name)
for r in range(res_per_eval):
res_item = res[(evals_idx*res_per_eval) + r]
res_key = res_item[0]
res_val = res_item[1]
if res_key in evals_result[key]:
evals_result[key][res_key].append(res_val)
else:
evals_result[key][res_key] = [res_val]
bst.best_iteration = (nboost - 1)
bst.best_ntree_limit = nboost * num_parallel_tree
return bst return bst
else: else:
@ -81,15 +156,18 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
if len(evals) < 1: if len(evals) < 1:
raise ValueError('For early stopping you need at least one set in evals.') raise ValueError('For early stopping you need at least one set in evals.')
sys.stderr.write("Will train until {} error hasn't decreased in {} rounds.\n".format(\ if verbose_eval:
sys.stderr.write("Will train until {} error hasn't decreased in {} rounds.\n".format(\
evals[-1][1], early_stopping_rounds)) evals[-1][1], early_stopping_rounds))
# is params a list of tuples? are we using multiple eval metrics? # is params a list of tuples? are we using multiple eval metrics?
if isinstance(params, list): if isinstance(params, list):
if len(params) != len(dict(params).items()): if len(params) != len(dict(params).items()):
raise ValueError('Check your params.'\ params = dict(params)
'Early stopping works with single eval metric only.') sys.stderr.write("Multiple eval metrics have been passed: " \
params = dict(params) "'{0}' will be used for early stopping.\n\n".format(params['eval_metric']))
else:
params = dict(params)
# either minimize loss or maximize AUC/MAP/NDCG # either minimize loss or maximize AUC/MAP/NDCG
maximize_score = False maximize_score = False
@ -97,6 +175,8 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
maximize_metrics = ('auc', 'map', 'ndcg') maximize_metrics = ('auc', 'map', 'ndcg')
if any(params['eval_metric'].startswith(x) for x in maximize_metrics): if any(params['eval_metric'].startswith(x) for x in maximize_metrics):
maximize_score = True maximize_score = True
if feval is not None:
maximize_score = maximize
if maximize_score: if maximize_score:
best_score = 0.0 best_score = 0.0
@ -104,10 +184,19 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
best_score = float('inf') best_score = float('inf')
best_msg = '' best_msg = ''
best_score_i = 0 best_score_i = (nboost - 1)
if isinstance(learning_rates, list) and len(learning_rates) != num_boost_round:
raise ValueError("Length of list 'learning_rates' has to equal 'num_boost_round'.")
for i in range(num_boost_round): for i in range(num_boost_round):
if learning_rates is not None:
if isinstance(learning_rates, list):
bst.set_param({'eta': learning_rates[i]})
else:
bst.set_param({'eta': learning_rates(i, num_boost_round)})
bst.update(dtrain, i, obj) bst.update(dtrain, i, obj)
nboost += 1
bst_eval_set = bst.eval_set(evals, i, feval) bst_eval_set = bst.eval_set(evals, i, feval)
if isinstance(bst_eval_set, STRING_TYPES): if isinstance(bst_eval_set, STRING_TYPES):
@ -116,26 +205,41 @@ def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,
msg = bst_eval_set.decode() msg = bst_eval_set.decode()
if verbose_eval: if verbose_eval:
sys.stderr.write(msg + '\n') if verbose_eval_every_line:
if i % verbose_eval_every_line == 0 or i == num_boost_round - 1:
sys.stderr.write(msg + '\n')
else:
sys.stderr.write(msg + '\n')
if evals_result is not None: if evals_result is not None:
res = re.findall(":-?([0-9.]+).", msg) res = re.findall("([0-9a-zA-Z@]+[-]*):-?([0-9.]+).", msg)
for key, val in zip(evals_name, res): for key in evals_name:
evals_result[key].append(val) evals_idx = evals_name.index(key)
res_per_eval = len(res) // len(evals_name)
for r in range(res_per_eval):
res_item = res[(evals_idx*res_per_eval) + r]
res_key = res_item[0]
res_val = res_item[1]
if res_key in evals_result[key]:
evals_result[key][res_key].append(res_val)
else:
evals_result[key][res_key] = [res_val]
score = float(msg.rsplit(':', 1)[1]) score = float(msg.rsplit(':', 1)[1])
if (maximize_score and score > best_score) or \ if (maximize_score and score > best_score) or \
(not maximize_score and score < best_score): (not maximize_score and score < best_score):
best_score = score best_score = score
best_score_i = i best_score_i = (nboost - 1)
best_msg = msg best_msg = msg
elif i - best_score_i >= early_stopping_rounds: elif i - best_score_i >= early_stopping_rounds:
sys.stderr.write("Stopping. Best iteration:\n{}\n\n".format(best_msg)) if verbose_eval:
sys.stderr.write("Stopping. Best iteration:\n{}\n\n".format(best_msg))
bst.best_score = best_score bst.best_score = best_score
bst.best_iteration = best_score_i bst.best_iteration = best_score_i
break break
bst.best_score = best_score bst.best_score = best_score
bst.best_iteration = best_score_i bst.best_iteration = best_score_i
bst.best_ntree_limit = (bst.best_iteration + 1) * num_parallel_tree
return bst return bst
@ -179,11 +283,14 @@ def mknfold(dall, nfold, param, seed, evals=(), fpreproc=None):
ret.append(CVPack(dtrain, dtest, plst)) ret.append(CVPack(dtrain, dtest, plst))
return ret return ret
def aggcv(rlist, show_stdv=True, show_progress=None, as_pandas=True, trial=0):
def aggcv(rlist, show_stdv=True, show_progress=None, as_pandas=True):
# pylint: disable=invalid-name # pylint: disable=invalid-name
""" """
Aggregate cross-validation results. Aggregate cross-validation results.
If show_progress is true, progress is displayed in every call. If
show_progress is an integer, progress will only be displayed every
`show_progress` trees, tracked via trial.
""" """
cvmap = {} cvmap = {}
idx = rlist[0].split()[0] idx = rlist[0].split()[0]
@ -217,8 +324,6 @@ def aggcv(rlist, show_stdv=True, show_progress=None, as_pandas=True):
index.extend([k + '-mean', k + '-std']) index.extend([k + '-mean', k + '-std'])
results.extend([mean, std]) results.extend([mean, std])
if as_pandas: if as_pandas:
try: try:
import pandas as pd import pandas as pd
@ -232,15 +337,16 @@ def aggcv(rlist, show_stdv=True, show_progress=None, as_pandas=True):
if show_progress is None: if show_progress is None:
show_progress = True show_progress = True
if show_progress: if (isinstance(show_progress, int) and trial % show_progress == 0) or (isinstance(show_progress, bool) and show_progress):
sys.stderr.write(msg + '\n') sys.stderr.write(msg + '\n')
sys.stderr.flush()
return results return results
def cv(params, dtrain, num_boost_round=10, nfold=3, metrics=(), def cv(params, dtrain, num_boost_round=10, nfold=3, metrics=(),
obj=None, feval=None, fpreproc=None, as_pandas=True, obj=None, feval=None, maximize=False, early_stopping_rounds=None,
show_progress=None, show_stdv=True, seed=0): fpreproc=None, as_pandas=True, show_progress=None, show_stdv=True, seed=0):
# pylint: disable = invalid-name # pylint: disable = invalid-name
"""Cross-validation with given paramaters. """Cross-validation with given paramaters.
@ -260,15 +366,23 @@ def cv(params, dtrain, num_boost_round=10, nfold=3, metrics=(),
Custom objective function. Custom objective function.
feval : function feval : function
Custom evaluation function. Custom evaluation function.
maximize : bool
Whether to maximize feval.
early_stopping_rounds: int
Activates early stopping. CV error needs to decrease at least
every <early_stopping_rounds> round(s) to continue.
Last entry in evaluation history is the one from best iteration.
fpreproc : function fpreproc : function
Preprocessing function that takes (dtrain, dtest, param) and returns Preprocessing function that takes (dtrain, dtest, param) and returns
transformed versions of those. transformed versions of those.
as_pandas : bool, default True as_pandas : bool, default True
Return pd.DataFrame when pandas is installed. Return pd.DataFrame when pandas is installed.
If False or pandas is not installed, return np.ndarray If False or pandas is not installed, return np.ndarray
show_progress : bool or None, default None show_progress : bool, int, or None, default None
Whether to display the progress. If None, progress will be displayed Whether to display the progress. If None, progress will be displayed
when np.ndarray is returned. when np.ndarray is returned. If True, progress will be displayed at
boosting stage. If an integer is given, progress will be displayed
at every given `show_progress` boosting stage.
show_stdv : bool, default True show_stdv : bool, default True
Whether to display the standard deviation in progress. Whether to display the standard deviation in progress.
Results are not affected, and always contains std. Results are not affected, and always contains std.
@ -279,6 +393,28 @@ def cv(params, dtrain, num_boost_round=10, nfold=3, metrics=(),
------- -------
evaluation history : list(string) evaluation history : list(string)
""" """
if early_stopping_rounds is not None:
if len(metrics) > 1:
raise ValueError('Check your params.'\
'Early stopping works with single eval metric only.')
sys.stderr.write("Will train until cv error hasn't decreased in {} rounds.\n".format(\
early_stopping_rounds))
maximize_score = False
if len(metrics) == 1:
maximize_metrics = ('auc', 'map', 'ndcg')
if any(metrics[0].startswith(x) for x in maximize_metrics):
maximize_score = True
if feval is not None:
maximize_score = maximize
if maximize_score:
best_score = 0.0
else:
best_score = float('inf')
best_score_i = 0
results = [] results = []
cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc) cvfolds = mknfold(dtrain, nfold, params, seed, metrics, fpreproc)
for i in range(num_boost_round): for i in range(num_boost_round):
@ -286,9 +422,20 @@ def cv(params, dtrain, num_boost_round=10, nfold=3, metrics=(),
fold.update(i, obj) fold.update(i, obj)
res = aggcv([f.eval(i, feval) for f in cvfolds], res = aggcv([f.eval(i, feval) for f in cvfolds],
show_stdv=show_stdv, show_progress=show_progress, show_stdv=show_stdv, show_progress=show_progress,
as_pandas=as_pandas) as_pandas=as_pandas, trial=i)
results.append(res) results.append(res)
if early_stopping_rounds is not None:
score = res[0]
if (maximize_score and score > best_score) or \
(not maximize_score and score < best_score):
best_score = score
best_score_i = i
elif i - best_score_i >= early_stopping_rounds:
sys.stderr.write("Stopping. Best iteration: {}\n".format(best_score_i))
results = results[:best_score_i+1]
break
if as_pandas: if as_pandas:
try: try:
import pandas as pd import pandas as pd
@ -299,4 +446,3 @@ def cv(params, dtrain, num_boost_round=10, nfold=3, metrics=(),
results = np.array(results) results = np.array(results)
return results return results

View File

@ -64,7 +64,7 @@ if [ ${TASK} == "python-package" -o ${TASK} == "python-package3" ]; then
conda create -n myenv python=2.7 conda create -n myenv python=2.7
fi fi
source activate myenv source activate myenv
conda install numpy scipy pandas matplotlib nose conda install numpy scipy pandas matplotlib nose scikit-learn
python -m pip install graphviz python -m pip install graphviz
make all CXX=${CXX} || exit -1 make all CXX=${CXX} || exit -1

View File

@ -14,7 +14,7 @@
namespace xgboost { namespace xgboost {
/*! /*!
* \brief unsigned interger type used in boost, * \brief unsigned integer type used in boost,
* used for feature index and row index * used for feature index and row index
*/ */
typedef unsigned bst_uint; typedef unsigned bst_uint;
@ -35,8 +35,8 @@ struct bst_gpair {
}; };
/*! /*!
* \brief extra information that might needed by gbm and tree module * \brief extra information that might be needed by gbm and tree module
* these information are not necessarily presented, and can be empty * this information is not necessarily present, and can be empty
*/ */
struct BoosterInfo { struct BoosterInfo {
/*! \brief number of rows in the data */ /*! \brief number of rows in the data */
@ -53,7 +53,7 @@ struct BoosterInfo {
/*! \brief number of rows, number of columns */ /*! \brief number of rows, number of columns */
BoosterInfo(void) : num_row(0), num_col(0) { BoosterInfo(void) : num_row(0), num_col(0) {
} }
/*! \brief get root of ith instance */ /*! \brief get root of i-th instance */
inline unsigned GetRoot(size_t i) const { inline unsigned GetRoot(size_t i) const {
return root_index.size() == 0 ? 0 : root_index[i]; return root_index.size() == 0 ? 0 : root_index[i];
} }
@ -120,13 +120,13 @@ struct ColBatch : public SparseBatch {
}; };
/** /**
* \brief interface of feature matrix, needed for tree construction * \brief interface of feature matrix, needed for tree construction
* this interface defines two way to access features, * this interface defines two ways to access features:
* row access is defined by iterator of RowBatch * row access is defined by iterator of RowBatch
* col access is optional, checked by HaveColAccess, and defined by iterator of ColBatch * col access is optional, checked by HaveColAccess, and defined by iterator of ColBatch
*/ */
class IFMatrix { class IFMatrix {
public: public:
// the interface only need to ganrantee row iter // the interface only need to guarantee row iter
// column iter is active, when ColIterator is called, row_iter can be disabled // column iter is active, when ColIterator is called, row_iter can be disabled
/*! \brief get the row iterator associated with FMatrix */ /*! \brief get the row iterator associated with FMatrix */
virtual utils::IIterator<RowBatch> *RowIterator(void) = 0; virtual utils::IIterator<RowBatch> *RowIterator(void) = 0;
@ -142,7 +142,7 @@ class IFMatrix {
* \brief check if column access is supported, if not, initialize column access * \brief check if column access is supported, if not, initialize column access
* \param enabled whether certain feature should be included in column access * \param enabled whether certain feature should be included in column access
* \param subsample subsample ratio when generating column access * \param subsample subsample ratio when generating column access
* \param max_row_perbatch auxilary information, maximum row used in each column batch * \param max_row_perbatch auxiliary information, maximum row used in each column batch
* this is a hint information that can be ignored by the implementation * this is a hint information that can be ignored by the implementation
*/ */
virtual void InitColAccess(const std::vector<bool> &enabled, virtual void InitColAccess(const std::vector<bool> &enabled,

View File

@ -58,7 +58,7 @@ class IGradBooster {
return false; return false;
} }
/*! /*!
* \brief peform update to the model(boosting) * \brief perform update to the model(boosting)
* \param p_fmat feature matrix that provide access to features * \param p_fmat feature matrix that provide access to features
* \param buffer_offset buffer index offset of these instances, if equals -1 * \param buffer_offset buffer index offset of these instances, if equals -1
* this means we do not have buffer index allocated to the gbm * this means we do not have buffer index allocated to the gbm
@ -88,7 +88,7 @@ class IGradBooster {
std::vector<float> *out_preds, std::vector<float> *out_preds,
unsigned ntree_limit = 0) = 0; unsigned ntree_limit = 0) = 0;
/*! /*!
* \brief online prediction funciton, predict score for one instance at a time * \brief online prediction function, predict score for one instance at a time
* NOTE: use the batch prediction interface if possible, batch prediction is usually * NOTE: use the batch prediction interface if possible, batch prediction is usually
* more efficient than online prediction * more efficient than online prediction
* This function is NOT threadsafe, make sure you only call from one thread * This function is NOT threadsafe, make sure you only call from one thread
@ -119,7 +119,7 @@ class IGradBooster {
/*! /*!
* \brief dump the model in text format * \brief dump the model in text format
* \param fmap feature map that may help give interpretations of feature * \param fmap feature map that may help give interpretations of feature
* \param option extra option of the dumo model * \param option extra option of the dump model
* \return a vector of dump for boosters * \return a vector of dump for boosters
*/ */
virtual std::vector<std::string> DumpModel(const utils::FeatMap& fmap, int option) = 0; virtual std::vector<std::string> DumpModel(const utils::FeatMap& fmap, int option) = 0;

View File

@ -31,7 +31,7 @@ class GBTree : public IGradBooster {
using namespace std; using namespace std;
if (!strncmp(name, "bst:", 4)) { if (!strncmp(name, "bst:", 4)) {
cfg.push_back(std::make_pair(std::string(name+4), std::string(val))); cfg.push_back(std::make_pair(std::string(name+4), std::string(val)));
// set into updaters, if already intialized // set into updaters, if already initialized
for (size_t i = 0; i < updaters.size(); ++i) { for (size_t i = 0; i < updaters.size(); ++i) {
updaters[i]->SetParam(name+4, val); updaters[i]->SetParam(name+4, val);
} }
@ -85,7 +85,7 @@ class GBTree : public IGradBooster {
fo.Write(BeginPtr(pred_counter), pred_counter.size() * sizeof(unsigned)); fo.Write(BeginPtr(pred_counter), pred_counter.size() * sizeof(unsigned));
} }
} }
// initialize the predic buffer // initialize the predict buffer
virtual void InitModel(void) { virtual void InitModel(void) {
pred_buffer.clear(); pred_counter.clear(); pred_buffer.clear(); pred_counter.clear();
pred_buffer.resize(mparam.PredBufferSize(), 0.0f); pred_buffer.resize(mparam.PredBufferSize(), 0.0f);
@ -138,10 +138,7 @@ class GBTree : public IGradBooster {
{ {
nthread = omp_get_num_threads(); nthread = omp_get_num_threads();
} }
thread_temp.resize(nthread, tree::RegTree::FVec()); InitThreadTemp(nthread);
for (int i = 0; i < nthread; ++i) {
thread_temp[i].Init(mparam.num_feature);
}
std::vector<float> &preds = *out_preds; std::vector<float> &preds = *out_preds;
const size_t stride = info.num_row * mparam.num_output_group; const size_t stride = info.num_row * mparam.num_output_group;
preds.resize(stride * (mparam.size_leaf_vector+1)); preds.resize(stride * (mparam.size_leaf_vector+1));
@ -194,10 +191,7 @@ class GBTree : public IGradBooster {
{ {
nthread = omp_get_num_threads(); nthread = omp_get_num_threads();
} }
thread_temp.resize(nthread, tree::RegTree::FVec()); InitThreadTemp(nthread);
for (int i = 0; i < nthread; ++i) {
thread_temp[i].Init(mparam.num_feature);
}
this->PredPath(p_fmat, info, out_preds, ntree_limit); this->PredPath(p_fmat, info, out_preds, ntree_limit);
} }
virtual std::vector<std::string> DumpModel(const utils::FeatMap& fmap, int option) { virtual std::vector<std::string> DumpModel(const utils::FeatMap& fmap, int option) {
@ -391,6 +385,16 @@ class GBTree : public IGradBooster {
} }
} }
} }
// init thread buffers
inline void InitThreadTemp(int nthread) {
int prev_thread_temp_size = thread_temp.size();
if (prev_thread_temp_size < nthread) {
thread_temp.resize(nthread, tree::RegTree::FVec());
for (int i = prev_thread_temp_size; i < nthread; ++i) {
thread_temp[i].Init(mparam.num_feature);
}
}
}
// --- data structure --- // --- data structure ---
/*! \brief training parameters */ /*! \brief training parameters */
@ -442,7 +446,7 @@ class GBTree : public IGradBooster {
int num_roots; int num_roots;
/*! \brief number of features to be used by trees */ /*! \brief number of features to be used by trees */
int num_feature; int num_feature;
/*! \brief size of predicton buffer allocated used for buffering */ /*! \brief size of prediction buffer allocated used for buffering */
int64_t num_pbuffer; int64_t num_pbuffer;
/*! /*!
* \brief how many output group a single instance can produce * \brief how many output group a single instance can produce

View File

@ -22,7 +22,7 @@ typedef learner::DMatrix DataMatrix;
* \param silent whether print message during loading * \param silent whether print message during loading
* \param savebuffer whether temporal buffer the file if the file is in text format * \param savebuffer whether temporal buffer the file if the file is in text format
* \param loadsplit whether we only load a split of input files * \param loadsplit whether we only load a split of input files
* such that each worker node get a split of the data * such that each worker node get a split of the data
* \param cache_file name of cache_file, used by external memory version * \param cache_file name of cache_file, used by external memory version
* can be NULL, if cache_file is specified, this will be the temporal * can be NULL, if cache_file is specified, this will be the temporal
* space that can be re-used to store intermediate data * space that can be re-used to store intermediate data
@ -38,7 +38,7 @@ DataMatrix* LoadDataMatrix(const char *fname,
* note: the saved dmatrix format may not be in exactly same as input * note: the saved dmatrix format may not be in exactly same as input
* SaveDMatrix will choose the best way to materialize the dmatrix. * SaveDMatrix will choose the best way to materialize the dmatrix.
* \param dmat the dmatrix to be saved * \param dmat the dmatrix to be saved
* \param fname file name to be savd * \param fname file name to be saved
* \param silent whether print message during saving * \param silent whether print message during saving
*/ */
void SaveDataMatrix(const DataMatrix &dmat, const char *fname, bool silent = false); void SaveDataMatrix(const DataMatrix &dmat, const char *fname, bool silent = false);

View File

@ -31,7 +31,7 @@ struct LibSVMPage : public SparsePage {
/*! /*!
* \brief libsvm parser that parses the input lines * \brief libsvm parser that parses the input lines
* and returns rows in input data * and returns rows in input data
* factry that was used by threadbuffer template * factory that was used by threadbuffer template
*/ */
class LibSVMPageFactory { class LibSVMPageFactory {
public: public:

Some files were not shown because too many files have changed in this diff Show More