Fix spelling in documents (#6948)

* Update roxygen2 doc.

Co-authored-by: fis <jm.yuan@outlook.com>
This commit is contained in:
Andrew Ziem
2021-05-11 06:44:36 -06:00
committed by GitHub
parent 2a9979e256
commit 3e7e426b36
100 changed files with 284 additions and 284 deletions

View File

@@ -1,5 +1,5 @@
---
title: "Understand your dataset with Xgboost"
title: "Understand your dataset with XGBoost"
output:
rmarkdown::html_vignette:
css: vignette.css
@@ -18,9 +18,9 @@ Understand your dataset with XGBoost
Introduction
------------
The purpose of this vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
The purpose of this vignette is to show you how to use **XGBoost** to discover and understand your own dataset better.
This vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*.
This vignette is not about predicting anything (see [XGBoost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **XGBoost** to highlight the *link* between the *features* of your data and the *outcome*.
Package loading:
@@ -39,7 +39,7 @@ Preparation of the dataset
### Numeric v.s. categorical variables
**Xgboost** manages only `numeric` vectors.
**XGBoost** manages only `numeric` vectors.
What to do when you have *categorical* data?
@@ -66,7 +66,7 @@ data(Arthritis)
df <- data.table(Arthritis, keep.rownames = FALSE)
```
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **XGBoost** **R** package use `data.table`.
The first thing we want to do is to have a look to the first few lines of the `data.table`:
@@ -166,7 +166,7 @@ output_vector = df[,Improved] == "Marked"
Build the model
---------------
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [XGBoost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
```{r}
bst <- xgboost(data = sparse_matrix, label = output_vector, max_depth = 4,
@@ -176,7 +176,7 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max_depth = 4,
You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.
A model which fits too well may [overfit](https://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).
A small value for training error may be a symptom of [overfitting](https://en.wikipedia.org/wiki/Overfitting), meaning the model will not accurately predict the future values.
> Here you can see the numbers decrease until line 7 and then increase.
>
@@ -304,19 +304,19 @@ Linear model may not be that smart in this scenario.
Special Note: What about Random Forests™?
-----------------------------------------
As you may know, [Random Forests](https://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) family.
As you may know, [Random Forests](https://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) family.
Both trains several decision trees for one dataset. The *main* difference is that in Random Forests, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
Both trains several decision trees for one dataset. The *main* difference is that in Random Forests, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
This difference have an impact on a corner case in feature importance analysis: the *correlated features*.
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests).
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests).
However, in Random Forests this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
However, in Random Forests this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
If you want to try Random Forests algorithm, you can tweak Xgboost parameters!
If you want to try Random Forests algorithm, you can tweak XGBoost parameters!
For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
@@ -326,7 +326,7 @@ data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
#Random Forest - 1000 trees
#Random Forest - 1000 trees
bst <- xgboost(data = train$data, label = train$label, max_depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nrounds = 1, objective = "binary:logistic")
#Boosting - 3 rounds
@@ -335,4 +335,4 @@ bst <- xgboost(data = train$data, label = train$label, max_depth = 4, nrounds =
> Note that the parameter `round` is set to `1`.
> [**Random Forests**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.
> [**Random Forests**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.

View File

@@ -1,5 +1,5 @@
---
title: "Xgboost presentation"
title: "XGBoost presentation"
output:
rmarkdown::html_vignette:
css: vignette.css
@@ -8,7 +8,7 @@ output:
bibliography: xgboost.bib
author: Tianqi Chen, Tong He, Michaël Benesty
vignette: >
%\VignetteIndexEntry{Xgboost presentation}
%\VignetteIndexEntry{XGBoost presentation}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
@@ -19,9 +19,9 @@ XGBoost R Tutorial
## Introduction
**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.
**XGBoost** is short for e**X**treme **G**radient **Boost**ing package.
The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions.
The purpose of this Vignette is to show you how to use **XGBoost** to build a model and make predictions.
It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:
@@ -46,10 +46,10 @@ It has several features:
## Installation
### Github version
### GitHub version
For weekly updated version (highly recommended), install from *Github*:
For weekly updated version (highly recommended), install from *GitHub*:
```{r installGithub, eval=FALSE}
install.packages("drat", repos="https://cran.rstudio.com")
@@ -82,7 +82,7 @@ require(xgboost)
### Dataset presentation
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the same as you will use on in your every day life :-).
Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
@@ -148,7 +148,7 @@ We will train decision tree model using the following parameters:
* `objective = "binary:logistic"`: we will train a binary classification model ;
* `max_depth = 2`: the trees won't be deep, because our case is very simple ;
* `nthread = 2`: the number of cpu threads we are going to use;
* `nthread = 2`: the number of CPU threads we are going to use;
* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.
```{r trainingSparse, message=F, warning=F}
@@ -180,7 +180,7 @@ bstDMatrix <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nround
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced techniques).
```{r trainingVerbose0, message=T, warning=F}
# verbose = 0, no message
@@ -253,7 +253,7 @@ The most important thing to remember is that **to do a classification, you just
*Multiclass* classification works in a similar way.
This metric is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!
This metric is **`r round(err, 2)`** and is pretty low: our yummy mushroom model works well!
## Advanced features

View File

@@ -16,7 +16,7 @@ XGBoost from JSON
## Introduction
The purpose of this Vignette is to show you how to correctly load and work with an **Xgboost** model that has been dumped to JSON. **Xgboost** internally converts all data to [32-bit floats](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), and the values dumped to JSON are decimal representations of these values. When working with a model that has been parsed from a JSON file, care must be taken to correctly treat:
The purpose of this Vignette is to show you how to correctly load and work with an **XGBoost** model that has been dumped to JSON. **XGBoost** internally converts all data to [32-bit floats](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), and the values dumped to JSON are decimal representations of these values. When working with a model that has been parsed from a JSON file, care must be taken to correctly treat:
- the input data, which should be converted to 32-bit floats
- any 32-bit floats that were stored in JSON as decimal representations
@@ -172,9 +172,9 @@ bst_from_json_preds <- ifelse(fl(data$dates)<fl(node$split_condition),
bst_preds == bst_from_json_preds
```
None are exactly equal again. What is going on here? Well, since we are using the value `1` in the calcuations, we have introduced a double into the calculation. Because of this, all float values are promoted to 64-bit doubles and the 64-bit version of the exponentiation operator `exp` is also used. On the other hand, xgboost uses the 32-bit version of the exponentation operator in its [sigmoid function](https://github.com/dmlc/xgboost/blob/54980b8959680a0da06a3fc0ec776e47c8cbb0a1/src/common/math.h#L25-L27).
None are exactly equal again. What is going on here? Well, since we are using the value `1` in the calculations, we have introduced a double into the calculation. Because of this, all float values are promoted to 64-bit doubles and the 64-bit version of the exponentiation operator `exp` is also used. On the other hand, xgboost uses the 32-bit version of the exponentiation operator in its [sigmoid function](https://github.com/dmlc/xgboost/blob/54980b8959680a0da06a3fc0ec776e47c8cbb0a1/src/common/math.h#L25-L27).
How do we fix this? We have to ensure we use the correct datatypes everywhere and the correct operators. If we use only floats, the float library that we have loaded will ensure the 32-bit float exponention operator is applied.
How do we fix this? We have to ensure we use the correct data types everywhere and the correct operators. If we use only floats, the float library that we have loaded will ensure the 32-bit float exponentiation operator is applied.
```{r}
# calculate the predictions casting doubles to floats
bst_from_json_preds <- ifelse(fl(data$dates)<fl(node$split_condition),