[doc] Document update [skip ci] (#8784)

- Remove version specifics in cat demo.
- Remove aws yarn.
- Update faq.
- Stop mentioning MPI.
- Update sphinx inventory links.
- Fix typo.
This commit is contained in:
Jiaming Yuan 2023-02-12 04:25:22 +08:00 committed by GitHub
parent 8a16944664
commit e9c178f402
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 19 additions and 31 deletions

View File

@ -20,7 +20,7 @@
XGBoost is an optimized distributed gradient boosting library designed to be highly ***efficient***, ***flexible*** and ***portable***.
It implements machine learning algorithms under the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) framework.
XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.
The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, Dask, Spark, PySpark) and can solve problems beyond billions of examples.
License
-------

View File

@ -2,9 +2,7 @@
Getting started with categorical data
=====================================
Experimental support for categorical data. After 1.5 XGBoost `gpu_hist` tree method has
experimental support for one-hot encoding based tree split, and in 1.6 `approx` support
was added.
Experimental support for categorical data.
In before, users need to run an encoder themselves before passing the data into XGBoost,
which creates a sparse matrix and potentially increase memory usage. This demo

View File

@ -211,8 +211,8 @@ latex_documents = [
intersphinx_mapping = {
"python": ("https://docs.python.org/3.8", None),
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
"scipy": ("https://docs.scipy.org/doc/scipy/reference/", None),
"numpy": ("https://numpy.org/doc/stable/", None),
"scipy": ("https://docs.scipy.org/doc/scipy/", None),
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
"sklearn": ("https://scikit-learn.org/stable", None),
"dask": ("https://docs.dask.org/en/stable/", None),

View File

@ -19,15 +19,14 @@ I have a big dataset
********************
XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory.
This usually means millions of instances.
If you are running out of memory, checkout :doc:`external memory version </tutorials/external_memory>` or
:doc:`distributed version </tutorials/aws_yarn>` of XGBoost.
**************************************************
Running XGBoost on platform X (Hadoop/Yarn, Mesos)
**************************************************
The distributed version of XGBoost is designed to be portable to various environment.
Distributed XGBoost can be ported to any platform that supports `rabit <https://github.com/dmlc/rabit>`_.
You can directly run XGBoost on Yarn. In theory Mesos and other resource allocation engines can be easily supported as well.
If you are running out of memory, checkout the tutorial page for using :doc:`distributed training </tutorials/index>` with one of the many frameworks, or the :doc:`external memory version </tutorials/external_memory>` for using external memory.
**********************************
How to handle categorical feature?
**********************************
Visit :doc:`this tutorial </tutorials/categorical>` for a walk through of categorical data handling and some worked examples.
******************************************************************
Why not implement distributed XGBoost on top of X (Spark, Hadoop)?
@ -50,7 +49,7 @@ which means the model trained by one language can be loaded in another.
This means you can train the model using R, while running prediction using
Java or C++, which are more common in production systems.
You can also train the model using distributed versions,
and load them in from Python to do some interactive analysis.
and load them in from Python to do some interactive analysis. See :doc:`Model IO </tutorials/saving_model>` for more information.
**************************
Do you support LambdaMART?
@ -70,11 +69,10 @@ When the ``missing`` parameter is specifed, values in the input predictor that i
**************************************
Slightly different result between runs
**************************************
This could happen, due to non-determinism in floating point summation order and multi-threading.
Though the general accuracy will usually remain the same.
This could happen, due to non-determinism in floating point summation order and multi-threading. Also, data partitioning changes by distributed framework can be an issue as well. Though the general accuracy will usually remain the same.
**********************************************************
Why do I see different results with sparse and dense data?
**********************************************************
"Sparse" elements are treated as if they were "missing" by the tree booster, and as zeros by the linear booster.
For tree models, it is important to use consistent data formats during training and scoring.
"Sparse" elements are treated as if they were "missing" by the tree booster, and as zeros by the linear booster. However, if we convert the sparse matrix back to dense matrix, the sparse matrix might fill the missing entries with 0, which is a valid value for xgboost.

View File

@ -35,4 +35,5 @@ list of trees and can be sliced into multiple sub-models.
The sliced model is a copy of selected trees, that means the model itself is immutable
during slicing. This feature is the basis of `save_best` option in early stopping
callback.
callback. See :ref:`sphx_glr_python_examples_individual_trees.py` for a worked example on
how to combine prediction with sliced trees.

View File

@ -1,8 +0,0 @@
###############################
Distributed XGBoost YARN on AWS
###############################
[This page is under construction.]
.. note:: XGBoost with Spark
If you are preprocessing training data with Spark, consider using :doc:`XGBoost4J-Spark </jvm/xgboost4j_spark_tutorial>`.

View File

@ -149,7 +149,7 @@ performance reasons.
References
**********
[1] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
[1] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_". Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
[2] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "`The Elements of Statistical Learning`_". Springer Series in Statistics Springer New York Inc. (2001).

View File

@ -3,7 +3,7 @@ XGBoost Tutorials
#################
This section contains official tutorials inside XGBoost package.
See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for more resources.
See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for more resources. Also, don't miss the feature introductions in each package.
.. toctree::
:maxdepth: 1
@ -11,7 +11,6 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo
model
saving_model
Distributed XGBoost with AWS YARN <aws_yarn>
kubernetes
Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
Distributed XGBoost with XGBoost4J-Spark-GPU <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html>