From 18bd16341a45723e15584eccfc1c52b2041e098c Mon Sep 17 00:00:00 2001
From: Jiaming Yuan <jm.yuan@outlook.com>
Date: Tue, 21 Sep 2021 10:47:09 +0800
Subject: [PATCH] Update Python intro. [skip ci] (#7235)

* Fix the link to demo.
* Stop recommending text file inputs.
* Brief mention to scikit-learn interface.
* Fix indent warning in tree method doc.
---
 doc/faq.rst                 |  3 ++
 doc/python/python_intro.rst | 87 +++++++++++++++++++++++--------------
 doc/treemethod.rst          | 18 ++++----
 3 files changed, 67 insertions(+), 41 deletions(-)

diff --git a/doc/faq.rst b/doc/faq.rst
index dcaa4b1af..4ef5b9a8e 100644
--- a/doc/faq.rst
+++ b/doc/faq.rst
@@ -64,6 +64,9 @@ XGBoost supports missing values by default.
 In tree algorithms, branch directions for missing values are learned during training.
 Note that the gblinear booster treats missing values as zeros.
 
+When the ``missing`` parameter is specifed, values in the input predictor that is equal to
+``missing`` will be treated as missing and removed.  By default it's set to ``NaN``.
+
 **************************************
 Slightly different result between runs
 **************************************
diff --git a/doc/python/python_intro.rst b/doc/python/python_intro.rst
index 0b0fccab8..e025d367c 100644
--- a/doc/python/python_intro.rst
+++ b/doc/python/python_intro.rst
@@ -5,7 +5,7 @@ This document gives a basic walkthrough of the xgboost package for Python.
 
 **List of other Helpful Links**
 
-* `Python walkthrough code collections <https://github.com/tqchen/xgboost/blob/master/demo/guide-python>`_
+* `Python walkthrough code collections <https://github.com/dmlc/xgboost/blob/master/demo/guide-python>`_
 * :doc:`Python API Reference <python_api>`
 
 Install XGBoost
@@ -22,45 +22,23 @@ To verify your installation, run the following in Python:
 
 Data Interface
 --------------
-The XGBoost python module is able to load data from:
+The XGBoost python module is able to load data from many types of different formats, including:
 
-- LIBSVM text format file
-- Comma-separated values (CSV) file
 - NumPy 2D array
 - SciPy 2D sparse array
+- Pandas data frame
 - cuDF DataFrame
-- Pandas data frame, and
+- cupy 2D array
+- dlpack
+- datatable
 - XGBoost binary buffer file.
+- LIBSVM text format file
+- Comma-separated values (CSV) file
 
 (See :doc:`/tutorials/input_format` for detailed description of text input format.)
 
 The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
 
-* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
-
-  .. code-block:: python
-
-    dtrain = xgb.DMatrix('train.svm.txt')
-    dtest = xgb.DMatrix('test.svm.buffer')
-
-* To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
-
-  .. code-block:: python
-
-    # label_column specifies the index of the column containing the true label
-    dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
-    dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
-
-  .. note:: Categorical features not supported
-
-    Note that XGBoost does not provide specialization for categorical features; if your data contains
-    categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like
-    `one-hot encoding <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_.
-
-  .. note:: Use Pandas to load CSV files with headers
-
-    Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV files with headers.
-
 * To load a NumPy array into :py:class:`DMatrix <xgboost.DMatrix>`:
 
   .. code-block:: python
@@ -95,18 +73,41 @@ The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
 
   .. code-block:: python
 
-    dtrain = xgb.DMatrix(data, label=label, missing=-999.0)
+    dtrain = xgb.DMatrix(data, label=label, missing=np.NaN)
 
 * Weights can be set when needed:
 
   .. code-block:: python
 
     w = np.random.rand(5, 1)
-    dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)
+    dtrain = xgb.DMatrix(data, label=label, missing=np.NaN, weight=w)
 
 When performing ranking tasks, the number of weights should be equal
 to number of groups.
 
+* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
+
+  .. code-block:: python
+
+    dtrain = xgb.DMatrix('train.svm.txt')
+    dtest = xgb.DMatrix('test.svm.buffer')
+
+  The parser in XGBoost has limited functionality. When using Python interface, it's
+  recommended to use sklearn ``load_svmlight_file`` or other similar utilites than
+  XGBoost's builtin parser.
+
+* To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
+
+  .. code-block:: python
+
+    # label_column specifies the index of the column containing the true label
+    dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
+    dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
+
+  The parser in XGBoost has limited functionality. When using Python interface, it's
+  recommended to use pandas ``read_csv`` or other similar utilites than XGBoost's builtin
+  parser.
+
 
 Setting Parameters
 ------------------
@@ -226,3 +227,25 @@ When you use ``IPython``, you can use the :py:meth:`xgboost.to_graphviz` functio
 .. code-block:: python
 
   xgb.to_graphviz(bst, num_trees=2)
+
+
+Scikit-Learn interface
+----------------------
+
+XGBoost provides an easy to use scikit-learn interface for some pre-defined models
+including regression, classification and ranking.
+
+.. code-block:: python
+
+  # Use "gpu_hist" for training the model.
+  reg = xgb.XGBRegressor(tree_method="gpu_hist")
+  # Fit the model using predictor X and response y.
+  reg.fit(X, y)
+  # Save model into JSON format.
+  reg.save_model("regressor.json")
+
+User can still access the underlying booster model when needed:
+
+.. code-block:: python
+
+   booster: xgb.Booster = reg.get_booster()
diff --git a/doc/treemethod.rst b/doc/treemethod.rst
index 50964fc9e..f47b8c027 100644
--- a/doc/treemethod.rst
+++ b/doc/treemethod.rst
@@ -85,16 +85,16 @@ Other Updaters
    ``min_split_loss (gamma)`` and ``max_depth``.
 
 2. ``Refresh``: Refresh the statistic of built trees on a new training dataset.  Like the
-  pruner, To use refresh independently, one needs to set the process type to update:
-  ``{"process_type": "update", "updater": "refresh"}``.  During training, the updater will
-  change statistics like ``cover`` and ``weight`` according to the new training dataset.
-  When ``refresh_leaf`` is also set to true (default), XGBoost will update the leaf value
-  according to the new leaf weight, but the tree structure (split condition) itself
-  doesn't change.
+   pruner, To use refresh independently, one needs to set the process type to update:
+   ``{"process_type": "update", "updater": "refresh"}``.  During training, the updater
+   will change statistics like ``cover`` and ``weight`` according to the new training
+   dataset.  When ``refresh_leaf`` is also set to true (default), XGBoost will update the
+   leaf value according to the new leaf weight, but the tree structure (split condition)
+   itself doesn't change.
 
-  There are examples on both training continuation (adding new trees) and using update
-  process on ``demo/guide-python``.  Also checkout the ``process_type`` parameter in
-  :doc:`parameter`.
+   There are examples on both training continuation (adding new trees) and using update
+   process on ``demo/guide-python``.  Also checkout the ``process_type`` parameter in
+   :doc:`parameter`.
 
 3. ``Sync``: Synchronize the tree among workers when running distributed training.