[R-package] GPL2 dependency reduction and some fixes (#1401)

* [R] do not remove zero coefficients from gblinear dump

* [R] switch from stringr to stringi

* fix #1399

* [R] separate ggplot backend, add base r graphics, cleanup, more plots, tests

* add missing include in amalgamation - fixes building R package in linux

* add forgotten file

* [R] fix DESCRIPTION

* [R] fix travis check issue and some cleanup
This commit is contained in:
Vadim Khotilovich
2016-07-27 02:05:04 -05:00
committed by Tong He
parent f6423056c0
commit d5c143367d
19 changed files with 548 additions and 312 deletions

View File

@@ -1,15 +0,0 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{edge.parser}
\alias{edge.parser}
\title{Parse the graph to extract vector of edges}
\usage{
edge.parser(element)
}
\arguments{
\item{element}{igraph object containing the path from the root to the leaf.}
}
\description{
Parse the graph to extract vector of edges
}

View File

@@ -1,15 +0,0 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{get.paths.to.leaf}
\alias{get.paths.to.leaf}
\title{Extract path from root to leaf from data.table}
\usage{
get.paths.to.leaf(dt_tree)
}
\arguments{
\item{dt_tree}{data.table containing the nodes and edges of the trees}
}
\description{
Extract path from root to leaf from data.table
}

View File

@@ -1,17 +0,0 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{multiplot}
\alias{multiplot}
\title{Plot multiple graphs at the same time}
\usage{
multiplot(..., cols = 1)
}
\arguments{
\item{...}{the plots}
\item{cols}{number of columns}
}
\description{
Plot multiple graph aligned by rows and columns.
}

View File

@@ -1,46 +1,74 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{xgb.plot.deepness}
% Please edit documentation in R/xgb.ggplot.R, R/xgb.plot.deepness.R
\name{xgb.ggplot.deepness}
\alias{xgb.ggplot.deepness}
\alias{xgb.plot.deepness}
\title{Plot model trees deepness}
\usage{
xgb.plot.deepness(model = NULL)
xgb.ggplot.deepness(model = NULL, which = c("2x1", "max.depth", "med.depth",
"med.weight"))
xgb.plot.deepness(model = NULL, which = c("2x1", "max.depth", "med.depth",
"med.weight"), plot = TRUE, ...)
}
\arguments{
\item{model}{dump generated by the \code{xgb.train} function.}
\item{model}{either an \code{xgb.Booster} model generated by the \code{xgb.train} function
or a data.table result of the \code{xgb.model.dt.tree} function.}
\item{which}{which distribution to plot (see details).}
\item{plot}{(base R barplot) whether a barplot should be produced.
If FALSE, only a data.table is returned.}
\item{...}{other parameters passed to \code{barplot} or \code{plot}.}
}
\value{
Two graphs showing the distribution of the model deepness.
Other than producing plots (when \code{plot=TRUE}), the \code{xgb.plot.deepness} function
silently returns a processed data.table where each row corresponds to a terminal leaf in a tree model,
and contains information about leaf's depth, cover, and weight (which is used in calculating predictions).
The \code{xgb.ggplot.deepness} silently returns either a list of two ggplot graphs when \code{which="2x1"}
or a single ggplot graph for the other \code{which} options.
}
\description{
Generate a graph to plot the distribution of deepness among trees.
Visualizes distributions related to depth of tree leafs.
\code{xgb.plot.deepness} uses base R graphics, while \code{xgb.ggplot.deepness} uses the ggplot backend.
}
\details{
Display both the number of \code{leaf} and the distribution of \code{weighted observations}
by tree deepness level.
The purpose of this function is to help the user to find the best trade-off to set
the \code{max_depth} and \code{min_child_weight} parameters according to the bias / variance trade-off.
See \link{xgb.train} for more information about these parameters.
The graph is made of two parts:
When \code{which="2x1"}, two distributions with respect to the leaf depth
are plotted on top of each other:
\itemize{
\item Count: number of leaf per level of deepness;
\item Weighted cover: noramlized weighted cover per leaf (weighted number of instances).
\item the distribution of the number of leafs in a tree model at a certain depth;
\item the distribution of average weighted number of observations ("cover")
ending up in leafs at certain depth.
}
Those could be helpful in determining sensible ranges of the \code{max_depth}
and \code{min_child_weight} parameters.
This function is inspired by the blog post \url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}
When \code{which="max.depth"} or \code{which="med.depth"}, plots of either maximum or median depth
per tree with respect to tree number are created. And \code{which="med.weight"} allows to see how
a tree's median absolute leaf weight changes through the iterations.
This function was inspired by the blog post
\url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}.
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 15,
eta = 1, nthread = 2, nrounds = 30, objective = "binary:logistic",
min_child_weight = 50)
eta = 0.1, nthread = 2, nrounds = 50, objective = "binary:logistic",
subsample = 0.5, min_child_weight = 2)
xgb.plot.deepness(model = bst)
xgb.plot.deepness(bst)
xgb.ggplot.deepness(bst)
xgb.plot.deepness(bst, which='max.depth', pch=16, col=rgb(0,0,1,0.3), cex=2)
xgb.plot.deepness(bst, which='med.weight', pch=16, col=rgb(0,0,1,0.3), cex=2)
}
\seealso{
\code{\link{xgb.train}}, \code{\link{xgb.model.dt.tree}}.
}

View File

@@ -1,41 +1,82 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.importance.R
\name{xgb.plot.importance}
% Please edit documentation in R/xgb.ggplot.R, R/xgb.plot.importance.R
\name{xgb.ggplot.importance}
\alias{xgb.ggplot.importance}
\alias{xgb.plot.importance}
\title{Plot feature importance bar graph}
\title{Plot feature importance as a bar graph}
\usage{
xgb.plot.importance(importance_matrix = NULL, n_clusters = c(1:10), ...)
xgb.ggplot.importance(importance_matrix = NULL, top_n = NULL,
measure = NULL, rel_to_first = FALSE, n_clusters = c(1:10), ...)
xgb.plot.importance(importance_matrix = NULL, top_n = NULL,
measure = NULL, rel_to_first = FALSE, left_margin = 10, cex = NULL,
plot = TRUE, ...)
}
\arguments{
\item{importance_matrix}{a \code{data.table} returned by the \code{xgb.importance} function.}
\item{importance_matrix}{a \code{data.table} returned by \code{\link{xgb.importance}}.}
\item{n_clusters}{a \code{numeric} vector containing the min and the max range of the possible number of clusters of bars.}
\item{top_n}{maximal number of top features to include into the plot.}
\item{...}{currently not used}
\item{measure}{the name of importance measure to plot.
When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.}
\item{rel_to_first}{whether importance values should be represented as relative to the highest ranked feature.
See Details.}
\item{n_clusters}{(ggplot only) a \code{numeric} vector containing the min and the max range
of the possible number of clusters of bars.}
\item{...}{other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).}
\item{left_margin}{(base R barplot) allows to adjust the left margin size to fit feature names.
When it is NULL, the existing \code{par('mar')} is used.}
\item{cex}{(base R barplot) passed as \code{cex.names} parameter to \code{barplot}.}
\item{plot}{(base R barplot) whether a barplot should be produced.
If FALSE, only a data.table is returned.}
}
\value{
A \code{ggplot2} bar graph representing each feature by a horizontal bar. Longer is the bar, more important is the feature. Features are classified by importance and clustered by importance. The group is represented through the color of the bar.
The \code{xgb.plot.importance} function creates a \code{barplot} (when \code{plot=TRUE})
and silently returns a processed data.table with \code{n_top} features sorted by importance.
The \code{xgb.ggplot.importance} function returns a ggplot graph which could be customized afterwards.
E.g., to change the title of the graph, add \code{+ ggtitle("A GRAPH NAME")} to the result.
}
\description{
Read a data.table containing feature importance details and plot it (for both GLM and Trees).
Represents previously calculated feature importance as a bar graph.
\code{xgb.plot.importance} uses base R graphics, while \code{xgb.ggplot.importance} uses the ggplot backend.
}
\details{
The purpose of this function is to easily represent the importance of each feature of a model.
The function returns a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function.
The graph represents each feature as a horizontal bar of length proportional to the importance of a feature.
Features are shown ranked in a decreasing importance order.
It works for importances from both \code{gblinear} and \code{gbtree} models.
When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}.
For gbtree model, that would mean being normalized to the total of 1
("what is feature's importance contribution relative to the whole model?").
For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients.
Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
"what is feature's importance contribution relative to the most important feature?"
The ggplot-backend method also performs 1-D custering of the importance values,
with bar colors coresponding to different clusters that have somewhat similar importance values.
}
\examples{
data(agaricus.train, package='xgboost')
data(agaricus.train)
#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 3,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
importance_matrix <- xgb.importance(colnames(agaricus.train$data), model = bst)
xgb.plot.importance(importance_matrix)
xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
(gg <- xgb.ggplot.importance(importance_matrix, measure = "Frequency", rel_to_first = TRUE))
gg + ggplot2::ylab("Frequency")
}
\seealso{
\code{\link[graphics]{barplot}}.
}