From 18bd16341a45723e15584eccfc1c52b2041e098c Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Tue, 21 Sep 2021 10:47:09 +0800 Subject: [PATCH] Update Python intro. [skip ci] (#7235) * Fix the link to demo. * Stop recommending text file inputs. * Brief mention to scikit-learn interface. * Fix indent warning in tree method doc. --- doc/faq.rst | 3 ++ doc/python/python_intro.rst | 87 +++++++++++++++++++++++-------------- doc/treemethod.rst | 18 ++++---- 3 files changed, 67 insertions(+), 41 deletions(-) diff --git a/doc/faq.rst b/doc/faq.rst index dcaa4b1af..4ef5b9a8e 100644 --- a/doc/faq.rst +++ b/doc/faq.rst @@ -64,6 +64,9 @@ XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training. Note that the gblinear booster treats missing values as zeros. +When the ``missing`` parameter is specifed, values in the input predictor that is equal to +``missing`` will be treated as missing and removed. By default it's set to ``NaN``. + ************************************** Slightly different result between runs ************************************** diff --git a/doc/python/python_intro.rst b/doc/python/python_intro.rst index 0b0fccab8..e025d367c 100644 --- a/doc/python/python_intro.rst +++ b/doc/python/python_intro.rst @@ -5,7 +5,7 @@ This document gives a basic walkthrough of the xgboost package for Python. **List of other Helpful Links** -* `Python walkthrough code collections `_ +* `Python walkthrough code collections `_ * :doc:`Python API Reference ` Install XGBoost @@ -22,45 +22,23 @@ To verify your installation, run the following in Python: Data Interface -------------- -The XGBoost python module is able to load data from: +The XGBoost python module is able to load data from many types of different formats, including: -- LIBSVM text format file -- Comma-separated values (CSV) file - NumPy 2D array - SciPy 2D sparse array +- Pandas data frame - cuDF DataFrame -- Pandas data frame, and +- cupy 2D array +- dlpack +- datatable - XGBoost binary buffer file. +- LIBSVM text format file +- Comma-separated values (CSV) file (See :doc:`/tutorials/input_format` for detailed description of text input format.) The data is stored in a :py:class:`DMatrix ` object. -* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix `: - - .. code-block:: python - - dtrain = xgb.DMatrix('train.svm.txt') - dtest = xgb.DMatrix('test.svm.buffer') - -* To load a CSV file into :py:class:`DMatrix `: - - .. code-block:: python - - # label_column specifies the index of the column containing the true label - dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0') - dtest = xgb.DMatrix('test.csv?format=csv&label_column=0') - - .. note:: Categorical features not supported - - Note that XGBoost does not provide specialization for categorical features; if your data contains - categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like - `one-hot encoding `_. - - .. note:: Use Pandas to load CSV files with headers - - Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV files with headers. - * To load a NumPy array into :py:class:`DMatrix `: .. code-block:: python @@ -95,18 +73,41 @@ The data is stored in a :py:class:`DMatrix ` object. .. code-block:: python - dtrain = xgb.DMatrix(data, label=label, missing=-999.0) + dtrain = xgb.DMatrix(data, label=label, missing=np.NaN) * Weights can be set when needed: .. code-block:: python w = np.random.rand(5, 1) - dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w) + dtrain = xgb.DMatrix(data, label=label, missing=np.NaN, weight=w) When performing ranking tasks, the number of weights should be equal to number of groups. +* To load a LIBSVM text file or a XGBoost binary file into :py:class:`DMatrix `: + + .. code-block:: python + + dtrain = xgb.DMatrix('train.svm.txt') + dtest = xgb.DMatrix('test.svm.buffer') + + The parser in XGBoost has limited functionality. When using Python interface, it's + recommended to use sklearn ``load_svmlight_file`` or other similar utilites than + XGBoost's builtin parser. + +* To load a CSV file into :py:class:`DMatrix `: + + .. code-block:: python + + # label_column specifies the index of the column containing the true label + dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0') + dtest = xgb.DMatrix('test.csv?format=csv&label_column=0') + + The parser in XGBoost has limited functionality. When using Python interface, it's + recommended to use pandas ``read_csv`` or other similar utilites than XGBoost's builtin + parser. + Setting Parameters ------------------ @@ -226,3 +227,25 @@ When you use ``IPython``, you can use the :py:meth:`xgboost.to_graphviz` functio .. code-block:: python xgb.to_graphviz(bst, num_trees=2) + + +Scikit-Learn interface +---------------------- + +XGBoost provides an easy to use scikit-learn interface for some pre-defined models +including regression, classification and ranking. + +.. code-block:: python + + # Use "gpu_hist" for training the model. + reg = xgb.XGBRegressor(tree_method="gpu_hist") + # Fit the model using predictor X and response y. + reg.fit(X, y) + # Save model into JSON format. + reg.save_model("regressor.json") + +User can still access the underlying booster model when needed: + +.. code-block:: python + + booster: xgb.Booster = reg.get_booster() diff --git a/doc/treemethod.rst b/doc/treemethod.rst index 50964fc9e..f47b8c027 100644 --- a/doc/treemethod.rst +++ b/doc/treemethod.rst @@ -85,16 +85,16 @@ Other Updaters ``min_split_loss (gamma)`` and ``max_depth``. 2. ``Refresh``: Refresh the statistic of built trees on a new training dataset. Like the - pruner, To use refresh independently, one needs to set the process type to update: - ``{"process_type": "update", "updater": "refresh"}``. During training, the updater will - change statistics like ``cover`` and ``weight`` according to the new training dataset. - When ``refresh_leaf`` is also set to true (default), XGBoost will update the leaf value - according to the new leaf weight, but the tree structure (split condition) itself - doesn't change. + pruner, To use refresh independently, one needs to set the process type to update: + ``{"process_type": "update", "updater": "refresh"}``. During training, the updater + will change statistics like ``cover`` and ``weight`` according to the new training + dataset. When ``refresh_leaf`` is also set to true (default), XGBoost will update the + leaf value according to the new leaf weight, but the tree structure (split condition) + itself doesn't change. - There are examples on both training continuation (adding new trees) and using update - process on ``demo/guide-python``. Also checkout the ``process_type`` parameter in - :doc:`parameter`. + There are examples on both training continuation (adding new trees) and using update + process on ``demo/guide-python``. Also checkout the ``process_type`` parameter in + :doc:`parameter`. 3. ``Sync``: Synchronize the tree among workers when running distributed training.