Update Python intro. [skip ci] (#7235)

* Fix the link to demo. * Stop recommending text file inputs. * Brief mention to scikit-learn interface. * Fix indent warning in tree method doc.
2021-09-21 10:47:09 +08:00 · 2021-09-21 10:47:09 +08:00 · 18bd16341a
commit 18bd16341a
parent 61a619b5c3
3 changed files with 67 additions and 41 deletions
--- a/doc/faq.rst
+++ b/doc/faq.rst
@ -64,6 +64,9 @@ XGBoost supports missing values by default.
 In tree algorithms, branch directions for missing values are learned during training.
 Note that the gblinear booster treats missing values as zeros.
 When the ``missing`` parameter is specifed, values in the input predictor that is equal to
 ``missing`` will be treated as missing and removed.  By default it's set to ``NaN``.
 **************************************
 Slightly different result between runs
 **************************************
--- a/doc/python/python_intro.rst
+++ b/doc/python/python_intro.rst
@ -5,7 +5,7 @@ This document gives a basic walkthrough of the xgboost package for Python.
 **List of other Helpful Links**
-* `Python walkthrough code collections <https://github.com/tqchen/xgboost/blob/master/demo/guide-python>`_
+* `Python walkthrough code collections <https://github.com/dmlc/xgboost/blob/master/demo/guide-python>`_
 * :doc:`Python API Reference <python_api>`
 Install XGBoost
@ -22,45 +22,23 @@ To verify your installation, run the following in Python:
 Data Interface
 --------------
-The XGBoost python module is able to load data from:
+The XGBoost python module is able to load data from many types of different formats, including:
 - LIBSVM text format file
 - Comma-separated values (CSV) file
 - NumPy 2D array
 - SciPy 2D sparse array
 - Pandas data frame
 - cuDF DataFrame
- Pandas data frame, and
+- cupy 2D array
 - dlpack
 - datatable
 - XGBoost binary buffer file.
 - LIBSVM text format file
 - Comma-separated values (CSV) file
 (See :doc:`/tutorials/input_format` for detailed description of text input format.)
 The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
 * To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
  .. code-block:: python
    dtrain = xgb.DMatrix('train.svm.txt')
    dtest = xgb.DMatrix('test.svm.buffer')
 * To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
  .. code-block:: python
    # label_column specifies the index of the column containing the true label
    dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
    dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
  .. note:: Categorical features not supported
    Note that XGBoost does not provide specialization for categorical features; if your data contains
    categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like
    `one-hot encoding <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_.
  .. note:: Use Pandas to load CSV files with headers
    Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV files with headers.
 * To load a NumPy array into :py:class:`DMatrix <xgboost.DMatrix>`:
  .. code-block:: python
@ -95,18 +73,41 @@ The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
  .. code-block:: python
-    dtrain = xgb.DMatrix(data, label=label, missing=-999.0)
+    dtrain = xgb.DMatrix(data, label=label, missing=np.NaN)
 * Weights can be set when needed:
  .. code-block:: python
    w = np.random.rand(5, 1)
-    dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)
+    dtrain = xgb.DMatrix(data, label=label, missing=np.NaN, weight=w)
 When performing ranking tasks, the number of weights should be equal
 to number of groups.
 * To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
  .. code-block:: python
    dtrain = xgb.DMatrix('train.svm.txt')
    dtest = xgb.DMatrix('test.svm.buffer')
  The parser in XGBoost has limited functionality. When using Python interface, it's
  recommended to use sklearn ``load_svmlight_file`` or other similar utilites than
  XGBoost's builtin parser.
 * To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
  .. code-block:: python
    # label_column specifies the index of the column containing the true label
    dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
    dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
  The parser in XGBoost has limited functionality. When using Python interface, it's
  recommended to use pandas ``read_csv`` or other similar utilites than XGBoost's builtin
  parser.
 Setting Parameters
 ------------------
@ -226,3 +227,25 @@ When you use ``IPython``, you can use the :py:meth:`xgboost.to_graphviz` functio
 .. code-block:: python
  xgb.to_graphviz(bst, num_trees=2)
 Scikit-Learn interface
 ----------------------
 XGBoost provides an easy to use scikit-learn interface for some pre-defined models
 including regression, classification and ranking.
 .. code-block:: python
  # Use "gpu_hist" for training the model.
  reg = xgb.XGBRegressor(tree_method="gpu_hist")
  # Fit the model using predictor X and response y.
  reg.fit(X, y)
  # Save model into JSON format.
  reg.save_model("regressor.json")
 User can still access the underlying booster model when needed:
 .. code-block:: python
   booster: xgb.Booster = reg.get_booster()
--- a/doc/treemethod.rst
+++ b/doc/treemethod.rst
@ -86,11 +86,11 @@ Other Updaters
 2. ``Refresh``: Refresh the statistic of built trees on a new training dataset.  Like the
   pruner, To use refresh independently, one needs to set the process type to update:
-  ``{"process_type": "update", "updater": "refresh"}``.  During training, the updater will
+   ``{"process_type": "update", "updater": "refresh"}``.  During training, the updater
-  change statistics like ``cover`` and ``weight`` according to the new training dataset.
+   will change statistics like ``cover`` and ``weight`` according to the new training
-  When ``refresh_leaf`` is also set to true (default), XGBoost will update the leaf value
+   dataset.  When ``refresh_leaf`` is also set to true (default), XGBoost will update the
-  according to the new leaf weight, but the tree structure (split condition) itself
+   leaf value according to the new leaf weight, but the tree structure (split condition)
-  doesn't change.
+   itself doesn't change.
   There are examples on both training continuation (adding new trees) and using update
   process on ``demo/guide-python``.  Also checkout the ``process_type`` parameter in