Note for DaskDMatrix. (#5144)

* Brief introduction to `DaskDMatrix`.

* Add xgboost.dask.train to API doc
This commit is contained in:
Jiaming Yuan 2019-12-23 18:55:32 +08:00 committed by GitHub
parent c8bdb652c4
commit a4b929385e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 63 additions and 37 deletions

View File

@ -82,6 +82,8 @@ Dask API
.. autofunction:: xgboost.dask.DaskDMatrix .. autofunction:: xgboost.dask.DaskDMatrix
.. autofunction:: xgboost.dask.train
.. autofunction:: xgboost.dask.predict .. autofunction:: xgboost.dask.predict
.. autofunction:: xgboost.dask.DaskXGBClassifier .. autofunction:: xgboost.dask.DaskXGBClassifier

View File

@ -77,6 +77,27 @@ interface with ``DaskXGBClassifier`` and ``DaskXGBRegressor``. See ``xgboost/de
for more examples. for more examples.
*****************************************************************************
Why is the initialization of ``DaskDMatrix`` so slow and throws weird errors
*****************************************************************************
The dask API in XGBoost requires construction of ``DaskDMatrix``. With ``Scikit-Learn``
interface, ``DaskDMatrix`` is implicitly constructed for each input data during `fit` or
`predict`. You might have observed its construction is taking incredible amount of time,
and sometimes throws error that doesn't seem to be relevant to `DaskDMatrix`. Here is a
brief explanation for why. By default most of dask's computation is `lazy
<https://docs.dask.org/en/latest/user-interfaces.html#laziness-and-computing>`_, which
means the computation is not carried out until you explicitly ask for result, either by
calling `compute()` or `wait()`. See above link for details in dask, and `this wiki
<https://en.wikipedia.org/wiki/Lazy_evaluation>`_ for general concept of lazy evaluation.
The `DaskDMatrix` constructor forces all lazy computation to materialize, which means it's
where all your earlier computation actually being carried out, including operations like
`dd.read_csv()`. To isolate the computation in `DaskDMatrix` from other lazy
computations, one can explicitly wait for results of input data before calling constructor
of `DaskDMatrix`. Also dask's `web interface
<https://distributed.dask.org/en/latest/web.html>`_ can be used to monitor what operations
are currently being performed.
*********** ***********
Limitations Limitations
*********** ***********

View File

@ -113,25 +113,28 @@ def _assert_client(client):
class DaskDMatrix: class DaskDMatrix:
# pylint: disable=missing-docstring, too-many-instance-attributes # pylint: disable=missing-docstring, too-many-instance-attributes
'''DMatrix holding on references to Dask DataFrame or Dask Array. '''DMatrix holding on references to Dask DataFrame or Dask Array. Constructing
a `DaskDMatrix` forces all lazy computation to be carried out. Wait for
the input data explicitly if you want to see actual computation of
constructing `DaskDMatrix`.
Parameters Parameters
---------- ----------
client: dask.distributed.Client client: dask.distributed.Client
Specify the dask client used for training. Use default client Specify the dask client used for training. Use default client
returned from dask if it's set to None. returned from dask if it's set to None.
data : dask.array.Array/dask.dataframe.DataFrame data : dask.array.Array/dask.dataframe.DataFrame
data source of DMatrix. data source of DMatrix.
label: dask.array.Array/dask.dataframe.DataFrame label: dask.array.Array/dask.dataframe.DataFrame
label used for trainin. label used for trainin.
missing : float, optional missing : float, optional
Value in the input data (e.g. `numpy.ndarray`) which needs Value in the input data (e.g. `numpy.ndarray`) which needs
to be present as a missing value. If None, defaults to np.nan. to be present as a missing value. If None, defaults to np.nan.
weight : dask.array.Array/dask.dataframe.DataFrame weight : dask.array.Array/dask.dataframe.DataFrame
Weight for each instance. Weight for each instance.
feature_names : list, optional feature_names : list, optional
Set names for features. Set names for features.
feature_types : list, optional feature_types : list, optional
Set types for features Set types for features
''' '''
@ -349,23 +352,23 @@ def train(client, params, dtrain, *args, evals=(), **kwargs):
Parameters Parameters
---------- ----------
client: dask.distributed.Client client: dask.distributed.Client
Specify the dask client used for training. Use default client Specify the dask client used for training. Use default client
returned from dask if it's set to None. returned from dask if it's set to None.
\\*\\*kwargs:
Other parameters are the same as `xgboost.train` except for `evals_result`, Other parameters are the same as `xgboost.train` except for `evals_result`,
which is returned as part of function return value instead of argument. which is returned as part of function return value instead of argument.
Returns Returns
------- -------
results: dict results: dict
A dictionary containing trained booster and evaluation history. A dictionary containing trained booster and evaluation history.
`history` field is the same as `eval_result` from `xgboost.train`. `history` field is the same as `eval_result` from `xgboost.train`.
.. code-block:: python .. code-block:: python
{'booster': xgboost.Booster, {'booster': xgboost.Booster,
'history': {'train': {'logloss': ['0.48253', '0.35953']}, 'history': {'train': {'logloss': ['0.48253', '0.35953']},
'eval': {'logloss': ['0.480385', '0.357756']}}} 'eval': {'logloss': ['0.480385', '0.357756']}}}
''' '''
_assert_dask_support() _assert_dask_support()
@ -420,15 +423,15 @@ def train(client, params, dtrain, *args, evals=(), **kwargs):
def predict(client, model, data, *args): def predict(client, model, data, *args):
'''Run prediction with a trained booster. '''Run prediction with a trained booster.
.. note:: .. note::
Only default prediction mode is supported right now. Only default prediction mode is supported right now.
Parameters Parameters
---------- ----------
client: dask.distributed.Client client: dask.distributed.Client
Specify the dask client used for training. Use default client Specify the dask client used for training. Use default client
returned from dask if it's set to None. returned from dask if it's set to None.
model: A Booster or a dictionary returned by `xgboost.dask.train`. model: A Booster or a dictionary returned by `xgboost.dask.train`.
The trained model. The trained model.
data: DaskDMatrix data: DaskDMatrix

View File

@ -136,26 +136,26 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
Edge color when meets the node condition. Edge color when meets the node condition.
no_color : str, default '#FF0000' no_color : str, default '#FF0000'
Edge color when doesn't meet the node condition. Edge color when doesn't meet the node condition.
condition_node_params : dict (optional) condition_node_params : dict, optional
Condition node configuration for for graphviz. Example: Condition node configuration for for graphviz. Example:
.. code-block:: python .. code-block:: python
{'shape': 'box', {'shape': 'box',
'style': 'filled,rounded', 'style': 'filled,rounded',
'fillcolor': '#78bceb'} 'fillcolor': '#78bceb'}
leaf_node_params : dict (optional) leaf_node_params : dict, optional
Leaf node configuration for graphviz. Example: Leaf node configuration for graphviz. Example:
.. code-block:: python .. code-block:: python
{'shape': 'box', {'shape': 'box',
'style': 'filled', 'style': 'filled',
'fillcolor': '#e48038'} 'fillcolor': '#e48038'}
kwargs : Other keywords passed to graphviz graph_attr, E.g.: \\*\\*kwargs: dict, optional
``graph [ {key} = {value} ]`` Other keywords passed to graphviz graph_attr, e.g. ``graph [ {key} = {value} ]``
Returns Returns
------- -------