[doc] Document update [skip ci] (#8784)

- Remove version specifics in cat demo. - Remove aws yarn. - Update faq. - Stop mentioning MPI. - Update sphinx inventory links. - Fix typo.
2023-02-12 04:25:22 +08:00
parent 8a16944664
commit e9c178f402
8 changed files with 19 additions and 31 deletions
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@
 XGBoost is an optimized distributed gradient boosting library designed to be highly ***efficient***, ***flexible*** and ***portable***.
 It implements machine learning algorithms under the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) framework.
 XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.
-The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.
+The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, Dask, Spark, PySpark) and can solve problems beyond billions of examples.

 License
 -------
--- a/demo/guide-python/categorical.py
+++ b/demo/guide-python/categorical.py
@@ -2,9 +2,7 @@
 Getting started with categorical data
 =====================================

-Experimental support for categorical data.  After 1.5 XGBoost `gpu_hist` tree method has
-experimental support for one-hot encoding based tree split, and in 1.6 `approx` support
-was added.
+Experimental support for categorical data.

 In before, users need to run an encoder themselves before passing the data into XGBoost,
 which creates a sparse matrix and potentially increase memory usage.  This demo
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -211,8 +211,8 @@ latex_documents = [

 intersphinx_mapping = {
    "python": ("https://docs.python.org/3.8", None),
-    "numpy": ("https://docs.scipy.org/doc/numpy/", None),
-    "scipy": ("https://docs.scipy.org/doc/scipy/reference/", None),
+    "numpy": ("https://numpy.org/doc/stable/", None),
+    "scipy": ("https://docs.scipy.org/doc/scipy/", None),
    "pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
    "sklearn": ("https://scikit-learn.org/stable", None),
    "dask": ("https://docs.dask.org/en/stable/", None),
--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -19,15 +19,14 @@ I have a big dataset
 ********************
 XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory.
 This usually means millions of instances.
-If you are running out of memory, checkout :doc:`external memory version </tutorials/external_memory>` or
-:doc:`distributed version </tutorials/aws_yarn>` of XGBoost.

-**************************************************
-Running XGBoost on platform X (Hadoop/Yarn, Mesos)
-**************************************************
-The distributed version of XGBoost is designed to be portable to various environment.
-Distributed XGBoost can be ported to any platform that supports `rabit <https://github.com/dmlc/rabit>`_.
-You can directly run XGBoost on Yarn. In theory Mesos and other resource allocation engines can be easily supported as well.
+If you are running out of memory, checkout the tutorial page for using :doc:`distributed training </tutorials/index>` with one of the many frameworks, or the :doc:`external memory version </tutorials/external_memory>` for using external memory.
+
+
+**********************************
+How to handle categorical feature?
+**********************************
+Visit :doc:`this tutorial </tutorials/categorical>` for a walk through of categorical data handling and some worked examples.

 ******************************************************************
 Why not implement distributed XGBoost on top of X (Spark, Hadoop)?
@@ -50,7 +49,7 @@ which means the model trained by one language can be loaded in another.
 This means you can train the model using R, while running prediction using
 Java or C++, which are more common in production systems.
 You can also train the model using distributed versions,
-and load them in from Python to do some interactive analysis.
+and load them in from Python to do some interactive analysis. See :doc:`Model IO </tutorials/saving_model>` for more information.

 **************************
 Do you support LambdaMART?
@@ -70,11 +69,10 @@ When the ``missing`` parameter is specifed, values in the input predictor that i
 **************************************
 Slightly different result between runs
 **************************************
-This could happen, due to non-determinism in floating point summation order and multi-threading.
-Though the general accuracy will usually remain the same.
+This could happen, due to non-determinism in floating point summation order and multi-threading. Also, data partitioning changes by distributed framework can be an issue as well. Though the general accuracy will usually remain the same.

 **********************************************************
 Why do I see different results with sparse and dense data?
 **********************************************************
-"Sparse" elements are treated as if they were "missing" by the tree booster, and as zeros by the linear booster.
-For tree models, it is important to use consistent data formats during training and scoring.
+
+"Sparse" elements are treated as if they were "missing" by the tree booster, and as zeros by the linear booster. However, if we convert the sparse matrix back to dense matrix, the sparse matrix might fill the missing entries with 0, which is a valid value for xgboost.
--- a/doc/python/model.rst
+++ b/doc/python/model.rst
@@ -35,4 +35,5 @@ list of trees and can be sliced into multiple sub-models.

 The sliced model is a copy of selected trees, that means the model itself is immutable
 during slicing.  This feature is the basis of `save_best` option in early stopping
-callback.
+callback. See :ref:`sphx_glr_python_examples_individual_trees.py` for a worked example on
+how to combine prediction with sliced trees.
--- a/doc/tutorials/aws_yarn.rst
+++ b/doc/tutorials/aws_yarn.rst
@@ -1,8 +0,0 @@
-###############################
-Distributed XGBoost YARN on AWS
-###############################
-[This page is under construction.]
-
-.. note:: XGBoost with Spark
-
-  If you are preprocessing training data with Spark, consider using :doc:`XGBoost4J-Spark </jvm/xgboost4j_spark_tutorial>`.
--- a/doc/tutorials/categorical.rst
+++ b/doc/tutorials/categorical.rst
@@ -149,7 +149,7 @@ performance reasons.
 References
 **********

-[1] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
+[1] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_". Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.

 [2] Trevor Hastie, Robert Tibshirani, Jerome Friedman. "`The Elements of Statistical Learning`_". Springer Series in Statistics Springer New York Inc. (2001).

--- a/doc/tutorials/index.rst
+++ b/doc/tutorials/index.rst
@@ -3,7 +3,7 @@ XGBoost Tutorials
 #################

 This section contains official tutorials inside XGBoost package.
-See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for more resources.
+See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for more resources. Also, don't miss the feature introductions in each package.

 .. toctree::
  :maxdepth: 1
@@ -11,7 +11,6 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo

  model
  saving_model
-  Distributed XGBoost with AWS YARN <aws_yarn>
  kubernetes
  Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
  Distributed XGBoost with XGBoost4J-Spark-GPU <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html>