[R-package] GPL2 dependency reduction and some fixes (#1401)

* [R] do not remove zero coefficients from gblinear dump * [R] switch from stringr to stringi * fix #1399 * [R] separate ggplot backend, add base r graphics, cleanup, more plots, tests * add missing include in amalgamation - fixes building R package in linux * add forgotten file * [R] fix DESCRIPTION * [R] fix travis check issue and some cleanup
2016-07-27 02:05:04 -05:00
parent f6423056c0
commit d5c143367d
19 changed files with 548 additions and 312 deletions
--- a/R-package/man/edge.parser.Rd
+++ b/R-package/man/edge.parser.Rd
@@ -1,15 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/xgb.plot.deepness.R
-\name{edge.parser}
-\alias{edge.parser}
-\title{Parse the graph to extract vector of edges}
-\usage{
-edge.parser(element)
-}
-\arguments{
-\item{element}{igraph object containing the path from the root to the leaf.}
-}
-\description{
-Parse the graph to extract vector of edges
-}
-
--- a/R-package/man/get.paths.to.leaf.Rd
+++ b/R-package/man/get.paths.to.leaf.Rd
@@ -1,15 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/xgb.plot.deepness.R
-\name{get.paths.to.leaf}
-\alias{get.paths.to.leaf}
-\title{Extract path from root to leaf from data.table}
-\usage{
-get.paths.to.leaf(dt_tree)
-}
-\arguments{
-\item{dt_tree}{data.table containing the nodes and edges of the trees}
-}
-\description{
-Extract path from root to leaf from data.table
-}
-
--- a/R-package/man/multiplot.Rd
+++ b/R-package/man/multiplot.Rd
@@ -1,17 +0,0 @@
-% Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/xgb.plot.deepness.R
-\name{multiplot}
-\alias{multiplot}
-\title{Plot multiple graphs at the same time}
-\usage{
-multiplot(..., cols = 1)
-}
-\arguments{
-\item{...}{the plots}
-
-\item{cols}{number of columns}
-}
-\description{
-Plot multiple graph aligned by rows and columns.
-}
-
--- a/R-package/man/xgb.plot.deepness.Rd
+++ b/R-package/man/xgb.plot.deepness.Rd
@@ -1,46 +1,74 @@
 % Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/xgb.plot.deepness.R
-\name{xgb.plot.deepness}
+% Please edit documentation in R/xgb.ggplot.R, R/xgb.plot.deepness.R
+\name{xgb.ggplot.deepness}
+\alias{xgb.ggplot.deepness}
 \alias{xgb.plot.deepness}
 \title{Plot model trees deepness}
 \usage{
-xgb.plot.deepness(model = NULL)
+xgb.ggplot.deepness(model = NULL, which = c("2x1", "max.depth", "med.depth",
+  "med.weight"))
+
+xgb.plot.deepness(model = NULL, which = c("2x1", "max.depth", "med.depth",
+  "med.weight"), plot = TRUE, ...)
 }
 \arguments{
-\item{model}{dump generated by the \code{xgb.train} function.}
+\item{model}{either an \code{xgb.Booster} model generated by the \code{xgb.train} function
+or a data.table result of the \code{xgb.model.dt.tree} function.}
+
+\item{which}{which distribution to plot (see details).}
+
+\item{plot}{(base R barplot) whether a barplot should be produced. 
+If FALSE, only a data.table is returned.}
+
+\item{...}{other parameters passed to \code{barplot} or \code{plot}.}
 }
 \value{
-Two graphs showing the distribution of the model deepness.
+Other than producing plots (when \code{plot=TRUE}), the \code{xgb.plot.deepness} function
+silently returns a processed data.table where each row corresponds to a terminal leaf in a tree model,
+and contains information about leaf's depth, cover, and weight (which is used in calculating predictions).
+
+The \code{xgb.ggplot.deepness} silently returns either a list of two ggplot graphs when \code{which="2x1"}
+or a single ggplot graph for the other \code{which} options.
 }
 \description{
-Generate a graph to plot the distribution of deepness among trees.
+Visualizes distributions related to depth of tree leafs.
+\code{xgb.plot.deepness} uses base R graphics, while \code{xgb.ggplot.deepness} uses the ggplot backend.
 }
 \details{
-Display both the number of \code{leaf} and the distribution of \code{weighted observations}
-by tree deepness level.
-
-The purpose of this function is to help the user to find the best trade-off to set
-the \code{max_depth} and \code{min_child_weight} parameters according to the bias / variance trade-off.
-
-See \link{xgb.train} for more information about these parameters.
-
-The graph is made of two parts:
-
+When \code{which="2x1"}, two distributions with respect to the leaf depth
+are plotted on top of each other:
 \itemize{
- \item Count: number of leaf per level of deepness;
- \item Weighted cover: noramlized weighted cover per leaf (weighted number of instances).
+ \item the distribution of the number of leafs in a tree model at a certain depth;
+ \item the distribution of average weighted number of observations ("cover") 
+       ending up in leafs at certain depth.
 }
+Those could be helpful in determining sensible ranges of the \code{max_depth} 
+and \code{min_child_weight} parameters.

-This function is inspired by the blog post \url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}
+When \code{which="max.depth"} or \code{which="med.depth"}, plots of either maximum or median depth
+per tree with respect to tree number are created. And \code{which="med.weight"} allows to see how
+a tree's median absolute leaf weight changes through the iterations.
+
+This function was inspired by the blog post
+\url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}.
 }
 \examples{
+
 data(agaricus.train, package='xgboost')

 bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 15,
-                 eta = 1, nthread = 2, nrounds = 30, objective = "binary:logistic",
-                 min_child_weight = 50)
+               eta = 0.1, nthread = 2, nrounds = 50, objective = "binary:logistic",
+               subsample = 0.5, min_child_weight = 2)

-xgb.plot.deepness(model = bst)
+xgb.plot.deepness(bst)
+xgb.ggplot.deepness(bst)
+
+xgb.plot.deepness(bst, which='max.depth', pch=16, col=rgb(0,0,1,0.3), cex=2)
+
+xgb.plot.deepness(bst, which='med.weight', pch=16, col=rgb(0,0,1,0.3), cex=2)

 }
+\seealso{
+\code{\link{xgb.train}}, \code{\link{xgb.model.dt.tree}}.
+}

--- a/R-package/man/xgb.plot.importance.Rd
+++ b/R-package/man/xgb.plot.importance.Rd
@@ -1,41 +1,82 @@
 % Generated by roxygen2: do not edit by hand
-% Please edit documentation in R/xgb.plot.importance.R
-\name{xgb.plot.importance}
+% Please edit documentation in R/xgb.ggplot.R, R/xgb.plot.importance.R
+\name{xgb.ggplot.importance}
+\alias{xgb.ggplot.importance}
 \alias{xgb.plot.importance}
-\title{Plot feature importance bar graph}
+\title{Plot feature importance as a bar graph}
 \usage{
-xgb.plot.importance(importance_matrix = NULL, n_clusters = c(1:10), ...)
+xgb.ggplot.importance(importance_matrix = NULL, top_n = NULL,
+  measure = NULL, rel_to_first = FALSE, n_clusters = c(1:10), ...)
+
+xgb.plot.importance(importance_matrix = NULL, top_n = NULL,
+  measure = NULL, rel_to_first = FALSE, left_margin = 10, cex = NULL,
+  plot = TRUE, ...)
 }
 \arguments{
-\item{importance_matrix}{a \code{data.table} returned by the \code{xgb.importance} function.}
+\item{importance_matrix}{a \code{data.table} returned by \code{\link{xgb.importance}}.}

-\item{n_clusters}{a \code{numeric} vector containing the min and the max range of the possible number of clusters of bars.}
+\item{top_n}{maximal number of top features to include into the plot.}

-\item{...}{currently not used}
+\item{measure}{the name of importance measure to plot. 
+When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.}
+
+\item{rel_to_first}{whether importance values should be represented as relative to the highest ranked feature.
+See Details.}
+
+\item{n_clusters}{(ggplot only) a \code{numeric} vector containing the min and the max range 
+of the possible number of clusters of bars.}
+
+\item{...}{other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).}
+
+\item{left_margin}{(base R barplot) allows to adjust the left margin size to fit feature names.
+When it is NULL, the existing \code{par('mar')} is used.}
+
+\item{cex}{(base R barplot) passed as \code{cex.names} parameter to \code{barplot}.}
+
+\item{plot}{(base R barplot) whether a barplot should be produced. 
+If FALSE, only a data.table is returned.}
 }
 \value{
-A \code{ggplot2} bar graph representing each feature by a horizontal bar. Longer is the bar, more important is the feature. Features are classified by importance and clustered by importance. The group is represented through the color of the bar.
+The \code{xgb.plot.importance} function creates a \code{barplot} (when \code{plot=TRUE})
+and silently returns a processed data.table with \code{n_top} features sorted by importance.
+
+The \code{xgb.ggplot.importance} function returns a ggplot graph which could be customized afterwards.
+E.g., to change the title of the graph, add \code{+ ggtitle("A GRAPH NAME")} to the result.
 }
 \description{
-Read a data.table containing feature importance details and plot it (for both GLM and Trees).
+Represents previously calculated feature importance as a bar graph.
+\code{xgb.plot.importance} uses base R graphics, while \code{xgb.ggplot.importance} uses the ggplot backend.
 }
 \details{
-The purpose of this function is to easily represent the importance of each feature of a model.
-The function returns a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
-In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function.
+The graph represents each feature as a horizontal bar of length proportional to the importance of a feature.
+Features are shown ranked in a decreasing importance order.
+It works for importances from both \code{gblinear} and \code{gbtree} models.
+
+When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}.
+For gbtree model, that would mean being normalized to the total of 1 
+("what is feature's importance contribution relative to the whole model?").
+For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients.
+Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of 
+"what is feature's importance contribution relative to the most important feature?"
+
+The ggplot-backend method also performs 1-D custering of the importance values, 
+with bar colors coresponding to different clusters that have somewhat similar importance values.
 }
 \examples{
-data(agaricus.train, package='xgboost')
+data(agaricus.train)

-#Both dataset are list with two items, a sparse matrix and labels
-#(labels = outcome column which will be learned).
-#Each column of the sparse Matrix is a feature in one hot encoding format.
-
-bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
+bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 3,
               eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")

 importance_matrix <- xgb.importance(colnames(agaricus.train$data), model = bst)
-xgb.plot.importance(importance_matrix)
+
+xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
+
+(gg <- xgb.ggplot.importance(importance_matrix, measure = "Frequency", rel_to_first = TRUE))
+gg + ggplot2::ylab("Frequency")

 }
+\seealso{
+\code{\link[graphics]{barplot}}.
+}