Update Python intro. [skip ci] (#7235)

* Fix the link to demo.
* Stop recommending text file inputs.
* Brief mention to scikit-learn interface.
* Fix indent warning in tree method doc.
This commit is contained in:
Jiaming Yuan 2021-09-21 10:47:09 +08:00 committed by GitHub
parent 61a619b5c3
commit 18bd16341a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 67 additions and 41 deletions

View File

@ -64,6 +64,9 @@ XGBoost supports missing values by default.
In tree algorithms, branch directions for missing values are learned during training. In tree algorithms, branch directions for missing values are learned during training.
Note that the gblinear booster treats missing values as zeros. Note that the gblinear booster treats missing values as zeros.
When the ``missing`` parameter is specifed, values in the input predictor that is equal to
``missing`` will be treated as missing and removed. By default it's set to ``NaN``.
************************************** **************************************
Slightly different result between runs Slightly different result between runs
************************************** **************************************

View File

@ -5,7 +5,7 @@ This document gives a basic walkthrough of the xgboost package for Python.
**List of other Helpful Links** **List of other Helpful Links**
* `Python walkthrough code collections <https://github.com/tqchen/xgboost/blob/master/demo/guide-python>`_ * `Python walkthrough code collections <https://github.com/dmlc/xgboost/blob/master/demo/guide-python>`_
* :doc:`Python API Reference <python_api>` * :doc:`Python API Reference <python_api>`
Install XGBoost Install XGBoost
@ -22,45 +22,23 @@ To verify your installation, run the following in Python:
Data Interface Data Interface
-------------- --------------
The XGBoost python module is able to load data from: The XGBoost python module is able to load data from many types of different formats, including:
- LIBSVM text format file
- Comma-separated values (CSV) file
- NumPy 2D array - NumPy 2D array
- SciPy 2D sparse array - SciPy 2D sparse array
- Pandas data frame
- cuDF DataFrame - cuDF DataFrame
- Pandas data frame, and - cupy 2D array
- dlpack
- datatable
- XGBoost binary buffer file. - XGBoost binary buffer file.
- LIBSVM text format file
- Comma-separated values (CSV) file
(See :doc:`/tutorials/input_format` for detailed description of text input format.) (See :doc:`/tutorials/input_format` for detailed description of text input format.)
The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object. The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
.. code-block:: python
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
* To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
.. code-block:: python
# label_column specifies the index of the column containing the true label
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
.. note:: Categorical features not supported
Note that XGBoost does not provide specialization for categorical features; if your data contains
categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like
`one-hot encoding <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_.
.. note:: Use Pandas to load CSV files with headers
Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV files with headers.
* To load a NumPy array into :py:class:`DMatrix <xgboost.DMatrix>`: * To load a NumPy array into :py:class:`DMatrix <xgboost.DMatrix>`:
.. code-block:: python .. code-block:: python
@ -95,18 +73,41 @@ The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
.. code-block:: python .. code-block:: python
dtrain = xgb.DMatrix(data, label=label, missing=-999.0) dtrain = xgb.DMatrix(data, label=label, missing=np.NaN)
* Weights can be set when needed: * Weights can be set when needed:
.. code-block:: python .. code-block:: python
w = np.random.rand(5, 1) w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w) dtrain = xgb.DMatrix(data, label=label, missing=np.NaN, weight=w)
When performing ranking tasks, the number of weights should be equal When performing ranking tasks, the number of weights should be equal
to number of groups. to number of groups.
* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix <xgboost.DMatrix>`:
.. code-block:: python
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
The parser in XGBoost has limited functionality. When using Python interface, it's
recommended to use sklearn ``load_svmlight_file`` or other similar utilites than
XGBoost's builtin parser.
* To load a CSV file into :py:class:`DMatrix <xgboost.DMatrix>`:
.. code-block:: python
# label_column specifies the index of the column containing the true label
dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
The parser in XGBoost has limited functionality. When using Python interface, it's
recommended to use pandas ``read_csv`` or other similar utilites than XGBoost's builtin
parser.
Setting Parameters Setting Parameters
------------------ ------------------
@ -226,3 +227,25 @@ When you use ``IPython``, you can use the :py:meth:`xgboost.to_graphviz` functio
.. code-block:: python .. code-block:: python
xgb.to_graphviz(bst, num_trees=2) xgb.to_graphviz(bst, num_trees=2)
Scikit-Learn interface
----------------------
XGBoost provides an easy to use scikit-learn interface for some pre-defined models
including regression, classification and ranking.
.. code-block:: python
# Use "gpu_hist" for training the model.
reg = xgb.XGBRegressor(tree_method="gpu_hist")
# Fit the model using predictor X and response y.
reg.fit(X, y)
# Save model into JSON format.
reg.save_model("regressor.json")
User can still access the underlying booster model when needed:
.. code-block:: python
booster: xgb.Booster = reg.get_booster()

View File

@ -85,16 +85,16 @@ Other Updaters
``min_split_loss (gamma)`` and ``max_depth``. ``min_split_loss (gamma)`` and ``max_depth``.
2. ``Refresh``: Refresh the statistic of built trees on a new training dataset. Like the 2. ``Refresh``: Refresh the statistic of built trees on a new training dataset. Like the
pruner, To use refresh independently, one needs to set the process type to update: pruner, To use refresh independently, one needs to set the process type to update:
``{"process_type": "update", "updater": "refresh"}``. During training, the updater will ``{"process_type": "update", "updater": "refresh"}``. During training, the updater
change statistics like ``cover`` and ``weight`` according to the new training dataset. will change statistics like ``cover`` and ``weight`` according to the new training
When ``refresh_leaf`` is also set to true (default), XGBoost will update the leaf value dataset. When ``refresh_leaf`` is also set to true (default), XGBoost will update the
according to the new leaf weight, but the tree structure (split condition) itself leaf value according to the new leaf weight, but the tree structure (split condition)
doesn't change. itself doesn't change.
There are examples on both training continuation (adding new trees) and using update There are examples on both training continuation (adding new trees) and using update
process on ``demo/guide-python``. Also checkout the ``process_type`` parameter in process on ``demo/guide-python``. Also checkout the ``process_type`` parameter in
:doc:`parameter`. :doc:`parameter`.
3. ``Sync``: Synchronize the tree among workers when running distributed training. 3. ``Sync``: Synchronize the tree among workers when running distributed training.