Fix spelling in documents (#6948)

* Update roxygen2 doc.

Co-authored-by: fis <jm.yuan@outlook.com>
This commit is contained in:
Andrew Ziem
2021-05-11 06:44:36 -06:00
committed by GitHub
parent 2a9979e256
commit 3e7e426b36
100 changed files with 284 additions and 284 deletions

View File

@@ -5,9 +5,9 @@ Understand your dataset with XGBoost
Introduction
------------
The purpose of this Vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
The purpose of this Vignette is to show you how to use **XGBoost** to discover and understand your own dataset better.
This Vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*.
This Vignette is not about predicting anything (see [XGBoost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **XGBoost** to highlight the *link* between the *features* of your data and the *outcome*.
Package loading:
@@ -27,7 +27,7 @@ Preparation of the dataset
### Numeric VS categorical variables
**Xgboost** manages only `numeric` vectors.
**XGBoost** manages only `numeric` vectors.
What to do when you have *categorical* data?
@@ -55,7 +55,7 @@ data(Arthritis)
df <- data.table(Arthritis, keep.rownames = FALSE)
```
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **XGBoost** **R** package use `data.table`.
The first thing we want to do is to have a look to the first lines of the `data.table`:
@@ -217,7 +217,7 @@ output_vector = df[,Improved] == "Marked"
Build the model
---------------
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [XGBoost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
```r
@@ -422,19 +422,19 @@ Linear models may not be that smart in this scenario.
Special Note: What about Random Forests™?
-----------------------------------------
As you may know, [Random Forests](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
As you may know, [Random Forests](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
Both train several decision trees for one dataset. The *main* difference is that in Random Forests, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
Both train several decision trees for one dataset. The *main* difference is that in Random Forests, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
This difference have an impact on a corner case in feature importance analysis: the *correlated features*.
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests).
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests).
However, in Random Forests this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
However, in Random Forests this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
If you want to try Random Forests algorithm, you can tweak Xgboost parameters!
If you want to try Random Forests algorithm, you can tweak XGBoost parameters!
**Warning**: this is still an experimental parameter.
@@ -447,7 +447,7 @@ data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
#Random Forest - 1000 trees
#Random Forest - 1000 trees
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nrounds = 1, objective = "binary:logistic")
```
@@ -468,4 +468,4 @@ bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nrounds =
> Note that the parameter `round` is set to `1`.
> [**Random Forests**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.
> [**Random Forests**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.

View File

@@ -5,9 +5,9 @@ XGBoost R Tutorial
## Introduction
**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.
**XGBoost** is short for e**X**treme **G**radient **Boost**ing package.
The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions.
The purpose of this Vignette is to show you how to use **XGBoost** to build a model and make predictions.
It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:
@@ -32,10 +32,10 @@ It has several features:
## Installation
### Github version
### GitHub version
For weekly updated version (highly recommended), install from *Github*:
For weekly updated version (highly recommended), install from *GitHub*:
```r
@@ -177,7 +177,7 @@ We will train decision tree model using the following parameters:
* `objective = "binary:logistic"`: we will train a binary classification model ;
* `max.depth = 2`: the trees won't be deep, because our case is very simple ;
* `nthread = 2`: the number of cpu threads we are going to use;
* `nthread = 2`: the number of CPU threads we are going to use;
* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.

View File

@@ -23,7 +23,7 @@ C++ Coding Guideline
***********************
Python Coding Guideline
***********************
- Follow `PEP 8: Style Guide for Python Code <https://www.python.org/dev/peps/pep-0008/>`_. We use PyLint to automatically enforce PEP 8 style across our Python codebase. Before submitting your pull request, you are encouraged to run PyLint on your machine. See :ref:`running_checks_locally`.
- Follow `PEP 8: Style Guide for Python Code <https://www.python.org/dev/peps/pep-0008/>`_. We use Pylint to automatically enforce PEP 8 style across our Python codebase. Before submitting your pull request, you are encouraged to run Pylint on your machine. See :ref:`running_checks_locally`.
- Docstrings should be in `NumPy docstring format <https://numpydoc.readthedocs.io/en/latest/format.html>`_.
.. _running_checks_locally:

View File

@@ -25,6 +25,6 @@ inside the ``doc/`` directory.
Examples
********
* Use cases and examples will be in `demo <https://github.com/dmlc/xgboost/tree/master/demo>`_.
* We are super excited to hear about your story, if you have blogposts,
tutorials code solutions using XGBoost, please tell us and we will add
* We are super excited to hear about your story. If you have blog posts,
tutorials, or code solutions using XGBoost, please tell us, and we will add
a link in the example pages.

View File

@@ -15,7 +15,7 @@ A robust and efficient **continuous integration (CI)** infrastructure is one of
There are several CI services available free to open source projects, such as Travis CI and AppVeyor. The XGBoost project already utilizes Travis and AppVeyor. However, the XGBoost project has needs that these free services do not adequately address. In particular, the limited usage quota of resources such as CPU and memory leaves XGBoost developers unable to bring "too-intensive" tests. In addition, they do not offer test machines with GPUs for testing XGBoost-GPU code base which has been attracting more and more interest across many organizations. Consequently, the XGBoost project self-hosts a cloud server with Jenkins software installed: https://xgboost-ci.net/.
The self-hosted Jenkins CI server has recurring operating expenses. It utilizes a leading cloud provider (AWS) to accommodate variable workload. The master node serving the web interface is available 24/7, to accomodate contributions from people around the globe. In addition, the master node launches slave nodes on demand, to run the test suite on incoming contributions. To save cost, the slave nodes are terminated when they are no longer needed.
The self-hosted Jenkins CI server has recurring operating expenses. It utilizes a leading cloud provider (AWS) to accommodate variable workload. The master node serving the web interface is available 24/7, to accommodate contributions from people around the globe. In addition, the master node launches slave nodes on demand, to run the test suite on incoming contributions. To save cost, the slave nodes are terminated when they are no longer needed.
To help defray the hosting cost, the XGBoost project seeks donations from third parties.
@@ -25,7 +25,7 @@ Donors may choose to make one-time donations or recurring donations on monthly o
Fiscal host: Open Source Collective 501(c)(6)
---------------------------------------------
The Project Management Committee (PMC) of the XGBoost project appointed `Open Source Collective <https://opencollective.com/opensource>`_ as their **fiscal host**. The platform is a 501(c)(6) registered entity and will manage the funds on the behalf of the PMC so that PMC members will not have to manage the funds directly. The platform currently hosts several well-known Javascript frameworks such as Babel, Vue, and Webpack.
The Project Management Committee (PMC) of the XGBoost project appointed `Open Source Collective <https://opencollective.com/opensource>`_ as their **fiscal host**. The platform is a 501(c)(6) registered entity and will manage the funds on the behalf of the PMC so that PMC members will not have to manage the funds directly. The platform currently hosts several well-known JavaScript frameworks such as Babel, Vue, and Webpack.
All expenses incurred for hosting CI will be submitted to the fiscal host with receipts. Only the expenses in the following categories will be approved for reimbursement:

View File

@@ -8,7 +8,7 @@ Versioning Policy
Starting from XGBoost 1.0.0, each XGBoost release will be versioned as [MAJOR].[FEATURE].[MAINTENANCE]
* MAJOR: We gurantee the API compatibility across releases with the same major version number. We expect to have a 1+ years development period for a new MAJOR release version.
* MAJOR: We guarantee the API compatibility across releases with the same major version number. We expect to have a 1+ years development period for a new MAJOR release version.
* FEATURE: We ship new features, improvements and bug fixes through feature releases. The cycle length of a feature is decided by the size of feature roadmap. The roadmap is decided right after the previous release.
* MAINTENANCE: Maintenance version only contains bug fixes. This type of release only occurs when we found significant correctness and/or performance bugs and barrier for users to upgrade to a new version of XGBoost smoothly.
@@ -21,10 +21,10 @@ Making a Release
1. Modify ``CMakeLists.txt`` source tree, run CMake.
2. Modify ``DESCRIPTION`` in R-package.
3. Run ``change_version.sh`` in ``jvm-packages/dev``
3. Commit the change, create a PR on github on release branch. Port the bumped version to default branch, optionally with the postfix ``SNAPSHOT``.
4. Create a tag on release branch, either on github or locally.
5. Make a release on github tag page, which might be done with previous step if the tag is created on github.
6. Submit pip, cran and maven packages.
- pip package is maintained by [Hyunsu Cho](http://hyunsu-cho.io/) and [Jiaming Yuan](https://github.com/trivialfis). There's a helper script for downloading pre-built wheels on ``xgboost/dev/release-pypi.py`` along with simple instructions for using ``twine``.
- cran package is maintained by [Tong He](https://github.com/hetong007).
- maven packageis maintained by [Nan Zhu](https://github.com/CodingCat).
3. Commit the change, create a PR on GitHub on release branch. Port the bumped version to default branch, optionally with the postfix ``SNAPSHOT``.
4. Create a tag on release branch, either on GitHub or locally.
5. Make a release on GitHub tag page, which might be done with previous step if the tag is created on GitHub.
6. Submit pip, CRAN, and Maven packages.
- The pip package is maintained by [Hyunsu Cho](http://hyunsu-cho.io/) and [Jiaming Yuan](https://github.com/trivialfis). There's a helper script for downloading pre-built wheels on ``xgboost/dev/release-pypi.py`` along with simple instructions for using ``twine``.
- The CRAN package is maintained by [Tong He](https://github.com/hetong007).
- The Maven package is maintained by [Nan Zhu](https://github.com/CodingCat).

View File

@@ -212,7 +212,7 @@ You can run benchmarks on synthetic data for binary classification:
python tests/benchmark/benchmark_tree.py --tree_method=gpu_hist
python tests/benchmark/benchmark_tree.py --tree_method=hist
Training time on 1,000,000 rows x 50 columns of random data with 500 boosting iterations and 0.25/0.75 test/train split with AMD Ryzen 7 2700 8 core @3.20GHz and Nvidia 1080ti yields the following results:
Training time on 1,000,000 rows x 50 columns of random data with 500 boosting iterations and 0.25/0.75 test/train split with AMD Ryzen 7 2700 8 core @3.20GHz and NVIDIA 1080ti yields the following results:
+--------------+----------+
| tree_method | Time (s) |
@@ -242,14 +242,14 @@ If you are getting out-of-memory errors on a big dataset, try the `external memo
Developer notes
===============
The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the 'Monitor' class in cuda code will automatically appear in the nsight profiler.
The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the 'Monitor' class in CUDA code will automatically appear in the nsight profiler.
**********
References
**********
`Mitchell R, Frank E. (2017) Accelerating the XGBoost algorithm using GPU computing. PeerJ Computer Science 3:e127 https://doi.org/10.7717/peerj-cs.127 <https://peerj.com/articles/cs-127/>`_
`Nvidia Parallel Forall: Gradient Boosting, Decision Trees and XGBoost with CUDA <https://devblogs.nvidia.com/parallelforall/gradient-boosting-decision-trees-xgboost-cuda/>`_
`NVIDIA Parallel Forall: Gradient Boosting, Decision Trees and XGBoost with CUDA <https://devblogs.nvidia.com/parallelforall/gradient-boosting-decision-trees-xgboost-cuda/>`_
`Out-of-Core GPU Gradient Boosting <https://arxiv.org/abs/2005.09148>`_

View File

@@ -58,7 +58,7 @@ In this section, we use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ d
showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
"petal length" and "petal width". In addition, it contains the "class" columnm, which is essentially the label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
Read Dataset with Spark's Built-In Reader
-----------------------------------------
@@ -562,7 +562,7 @@ Checkpoint During Training
Transient failures are also commonly seen in production environment. To simplify the design of XGBoost,
we stop training if any of the distributed workers fail. However, if the training fails after having been through a long time, it would be a great waste of resources.
We support creating checkpoint during training to facilitate more efficient recovery from failture. To enable this feature, you can set how many iterations we build each checkpoint with ``setCheckpointInterval`` and the location of checkpoints with ``setCheckpointPath``:
We support creating checkpoint during training to facilitate more efficient recovery from failure. To enable this feature, you can set how many iterations we build each checkpoint with ``setCheckpointInterval`` and the location of checkpoints with ``setCheckpointPath``:
.. code-block:: scala

View File

@@ -191,7 +191,7 @@ Parameters for Tree Booster
- Choices: ``default``, ``update``
- ``default``: The normal boosting process which creates new trees.
- ``update``: Starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updaters is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iteratons performed. Currently, the following built-in updaters could be meaningfully used with this process type: ``refresh``, ``prune``. With ``process_type=update``, one cannot use updaters that create new trees.
- ``update``: Starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updaters is run for that tree, and a modified tree is added to the new model. The new model would have either the same or smaller number of trees, depending on the number of boosting iterations performed. Currently, the following built-in updaters could be meaningfully used with this process type: ``refresh``, ``prune``. With ``process_type=update``, one cannot use updaters that create new trees.
* ``grow_policy`` [default= ``depthwise``]
@@ -362,15 +362,15 @@ Specify the learning task and the corresponding learning objective. The objectiv
- ``binary:logistic``: logistic regression for binary classification, output probability
- ``binary:logitraw``: logistic regression for binary classification, output score before logistic transformation
- ``binary:hinge``: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
- ``count:poisson`` --poisson regression for count data, output mean of poisson distribution
- ``count:poisson`` --poisson regression for count data, output mean of Poisson distribution
- ``max_delta_step`` is set to 0.7 by default in poisson regression (used to safeguard optimization)
- ``max_delta_step`` is set to 0.7 by default in Poisson regression (used to safeguard optimization)
- ``survival:cox``: Cox regression for right censored survival time data (negative values are considered right censored).
Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function ``h(t) = h0(t) * HR``).
- ``survival:aft``: Accelerated failure time model for censored survival time data.
See :doc:`/tutorials/aft_survival_analysis` for details.
- ``aft_loss_distribution``: Probabilty Density Function used by ``survival:aft`` objective and ``aft-nloglik`` metric.
- ``aft_loss_distribution``: Probability Density Function used by ``survival:aft`` objective and ``aft-nloglik`` metric.
- ``multi:softmax``: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
- ``multi:softprob``: same as softmax, but output a vector of ``ndata * nclass``, which can be further reshaped to ``ndata * nclass`` matrix. The result contains predicted probability of each data point belonging to each class.
- ``rank:pairwise``: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized

View File

@@ -4,7 +4,7 @@ Callback Functions
This document gives a basic walkthrough of callback function used in XGBoost Python
package. In XGBoost 1.3, a new callback interface is designed for Python package, which
provides the flexiblity of designing various extension for training. Also, XGBoost has a
provides the flexibility of designing various extension for training. Also, XGBoost has a
number of pre-defined callbacks for supporting early stopping, checkpoints etc.

View File

@@ -1,6 +1,6 @@
Python API Reference
====================
This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about python package.
This page gives the Python API reference of xgboost, please also refer to Python Package Introduction for more information about the Python package.
.. contents::
:backlinks: none

View File

@@ -1,7 +1,7 @@
###########################
Python Package Introduction
###########################
This document gives a basic walkthrough of xgboost python package.
This document gives a basic walkthrough of the xgboost package for Python.
**List of other Helpful Links**
@@ -24,7 +24,7 @@ Data Interface
--------------
The XGBoost python module is able to load data from:
- LibSVM text format file
- LIBSVM text format file
- Comma-separated values (CSV) file
- NumPy 2D array
- SciPy 2D sparse array
@@ -36,7 +36,7 @@ The XGBoost python module is able to load data from:
The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
* To load a libsvm text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
.. code-block:: python

View File

@@ -80,7 +80,7 @@ Other Updaters
1. ``Pruner``: It prunes the built tree by ``gamma`` parameter. ``pruner`` is usually
used as part of other tree methods.
2. ``Refresh``: Refresh the statistic of bulilt trees on a new training dataset.
2. ``Refresh``: Refresh the statistic of built trees on a new training dataset.
3. ``Sync``: Synchronize the tree among workers when running distributed training.
****************
@@ -90,12 +90,12 @@ Removed Updaters
2 Updaters were removed during development due to maintainability. We describe them here
solely for the interest of documentation. First one is distributed colmaker, which was a
distributed version of exact tree method. It required specialization for column based
spliting strategy and a different prediction procedure. As the exact tree method is slow
splitting strategy and a different prediction procedure. As the exact tree method is slow
by itself and scaling is even less efficient, we removed it entirely. Second one is
``skmaker``. Per-node weighted sketching employed by ``grow_local_histmaker`` is slow,
the ``skmaker`` was unmaintained and seems to be a workaround trying to eliminate the
histogram creation step and uses sketching values directly during split evaluation. It
was never tested and contained some unknown bugs, we decided to remove it and focus our
resources on more promising algorithms instead. For accuracy, most of the time
``approx``, ``hist`` and ``gpu_hist`` are enough with some parameters tunning, so removing
``approx``, ``hist`` and ``gpu_hist`` are enough with some parameters tuning, so removing
them don't have any real practical impact.

View File

@@ -158,7 +158,7 @@ The parameter ``aft_loss_distribution`` corresponds to the distribution of the :
Currently, you can choose from three probability distributions for ``aft_loss_distribution``:
========================= ===========================================
``aft_loss_distribution`` Probabilty Density Function (PDF)
``aft_loss_distribution`` Probability Density Function (PDF)
========================= ===========================================
``normal`` :math:`\dfrac{\exp{(-z^2/2)}}{\sqrt{2\pi}}`
``logistic`` :math:`\dfrac{e^z}{(1+e^z)^2}`

View File

@@ -2,7 +2,7 @@
C API Tutorial
##############################
In this tutorial, we are going to install XGBoost library & configure the CMakeLists.txt file of our C/C++ application to link XGBoost library with our application. Later on, we will see some usefull tips for using C API and code snippets as examples to use various functions available in C API to perform basic task like loading, training model & predicting on test dataset.
In this tutorial, we are going to install XGBoost library & configure the CMakeLists.txt file of our C/C++ application to link XGBoost library with our application. Later on, we will see some useful tips for using C API and code snippets as examples to use various functions available in C API to perform basic task like loading, training model & predicting on test dataset.
.. contents::
:backlinks: none
@@ -68,11 +68,11 @@ To ensure that CMake can locate the XGBoost library, supply ``-DCMAKE_PREFIX_PAT
Usefull Tips To Remember
************************
Below are some usefull tips while using C API:
Below are some useful tips while using C API:
1. Error handling: Always check the return value of the C API functions.
a. In a C application: Use the following macro to guard all calls to XGBoost's C API functions. The macro prints all the error/ exception occured:
a. In a C application: Use the following macro to guard all calls to XGBoost's C API functions. The macro prints all the error/ exception occurred:
.. highlight:: c
:linenothreshold: 5

View File

@@ -143,6 +143,6 @@ For fully reproducible source code and comparison plots, see `custom_rmsle.py <h
Multi-class objective function
******************************
A similiar demo for multi-class objective funtion is also available, see
A similar demo for multi-class objective function is also available, see
`demo/guide-python/custom_softmax.py <https://github.com/dmlc/xgboost/tree/master/demo/guide-python/custom_softmax.py>`_
for details.

View File

@@ -127,7 +127,7 @@ In previous example we used ``DaskDMatrix`` as input to ``predict`` function. I
practice, it's also possible to call ``predict`` function directly on dask collections
like ``Array`` and ``DataFrame`` and might have better prediction performance. When
``DataFrame`` is used as prediction input, the result is a dask ``Series`` instead of
array. Also, there's inplace predict support on dask interface, which can help reducing
array. Also, there's in-place predict support on dask interface, which can help reducing
both memory usage and prediction time.
.. code-block:: python
@@ -479,7 +479,7 @@ Here are some pratices on reducing memory usage with dask and xgboost.
``xgboost.dask.DaskDeviceQuantileDMatrix`` as a drop in replacement for ``DaskDMatrix``
to reduce overall memory usage. See ``demo/dask/gpu_training.py`` for an example.
- Use inplace prediction when possible.
- Use in-place prediction when possible.
References:

View File

@@ -10,7 +10,7 @@ The external memory version takes in the following `URI <https://en.wikipedia.or
filename#cacheprefix
The ``filename`` is the normal path to libsvm format file you want to load in, and
The ``filename`` is the normal path to LIBSVM format file you want to load in, and
``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed
data in binary form.
@@ -24,7 +24,7 @@ where ``label_column`` should point to the csv column acting as the label.
To provide a simple example for illustration, extracting the code from
`demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_. If
you have a dataset stored in a file similar to ``agaricus.txt.train`` with libSVM format, the external memory support can be enabled by:
you have a dataset stored in a file similar to ``agaricus.txt.train`` with LIBSVM format, the external memory support can be enabled by:
.. code-block:: python

View File

@@ -5,7 +5,7 @@ Text Input Format of DMatrix
******************
Basic Input Format
******************
XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See `this Wikipedia article <https://en.wikipedia.org/wiki/Comma-separated_values>`_ for a description of the CSV format.). Please be careful that, XGBoost does **not** understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LibSVM or CSV. Instead it employs `URI <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ format for specifying the precise input file type. For example if you provide a `csv` file ``./data.train.csv`` as input, XGBoost will blindly use the default libsvm parser to digest it and generate a parser error. Instead, users need to provide an uri in the form of ``train.csv?format=csv``. For external memory input, the uri should of a form similar to ``train.csv?format=csv#dtrain.cache``. See :ref:`python_data_interface` and :doc:`/tutorials/external_memory` also.
XGBoost currently supports two text formats for ingesting data: LIBSVM and CSV. The rest of this document will describe the LIBSVM format. (See `this Wikipedia article <https://en.wikipedia.org/wiki/Comma-separated_values>`_ for a description of the CSV format.). Please be careful that, XGBoost does **not** understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LIBSVM or CSV. Instead it employs `URI <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ format for specifying the precise input file type. For example if you provide a `csv` file ``./data.train.csv`` as input, XGBoost will blindly use the default LIBSVM parser to digest it and generate a parser error. Instead, users need to provide an URI in the form of ``train.csv?format=csv``. For external memory input, the URI should of a form similar to ``train.csv?format=csv#dtrain.cache``. See :ref:`python_data_interface` and :doc:`/tutorials/external_memory` also.
For training or predicting, XGBoost takes an instance file with the format as below:
@@ -23,7 +23,7 @@ Each line represent a single instance, and in the first line '1' is the instance
******************************************
Auxiliary Files for Additional Information
******************************************
**Note: all information below is applicable only to single-node version of the package.** If you'd like to perform distributed training with multiple nodes, skip to the section `Embedding additional information inside LibSVM file`_.
**Note: all information below is applicable only to single-node version of the package.** If you'd like to perform distributed training with multiple nodes, skip to the section `Embedding additional information inside LIBSVM file`_.
Group Input Format
==================
@@ -72,13 +72,13 @@ XGBoost supports providing each instance an initial margin prediction. For examp
XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use ``pred_margin=1`` to output margin values.
***************************************************
Embedding additional information inside LibSVM file
Embedding additional information inside LIBSVM file
***************************************************
**This section is applicable to both single- and multiple-node settings.**
Query ID Columns
================
This is most useful for `ranking task <https://github.com/dmlc/xgboost/tree/master/demo/rank>`_, where the instances are grouped into query groups. You may embed query group ID for each instance in the LibSVM file by adding a token of form ``qid:xx`` in each row:
This is most useful for `ranking task <https://github.com/dmlc/xgboost/tree/master/demo/rank>`_, where the instances are grouped into query groups. You may embed query group ID for each instance in the LIBSVM file by adding a token of form ``qid:xx`` in each row:
.. code-block:: none
:caption: ``train.txt``
@@ -98,7 +98,7 @@ Keep in mind the following restrictions:
Instance weights
================
You may specify instance weights in the LibSVM file by appending each instance label with the corresponding weight in the form of ``[label]:[weight]``, as shown by the following example:
You may specify instance weights in the LIBSVM file by appending each instance label with the corresponding weight in the form of ``[label]:[weight]``, as shown by the following example:
.. code-block:: none
:caption: ``train.txt``

View File

@@ -1,9 +1,9 @@
#########################
Random Forests in XGBoost
Random Forests(TM) in XGBoost
#########################
XGBoost is normally used to train gradient-boosted decision trees and other gradient
boosted models. Random forests use the same model representation and inference, as
boosted models. Random Forests use the same model representation and inference, as
gradient-boosted decision trees, but a different training algorithm. One can use XGBoost
to train a standalone random forest or use random forest as a base model for gradient
boosting. Here we focus on training standalone random forest.

View File

@@ -31,10 +31,10 @@ part of model, that's because objective controls transformation of global bias (
evaluation or continue the training with a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save something
more than just the model itself. For example, in distrbuted training, XGBoost performs
more than just the model itself. For example, in distributed training, XGBoost performs
checkpointing operation. Or for some reasons, your favorite distributed computing
framework decide to copy the model from one worker to another and continue the training in
there. In such cases, the serialisation output is required to contain enougth information
there. In such cases, the serialisation output is required to contain enough information
to continue previous training without user providing any parameters again. We consider
such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it
with normal model IO operation. Currently, memory snapshot is used in the following places:
@@ -145,7 +145,7 @@ or in R:
config <- xgb.config(bst)
print(config)
Will print out something similiar to (not actual output as it's too long for demonstration):
Will print out something similar to (not actual output as it's too long for demonstration):
.. code-block:: javascript