Note for DaskDMatrix. (#5144)

* Brief introduction to `DaskDMatrix`.

* Add xgboost.dask.train to API doc
This commit is contained in:
Jiaming Yuan 2019-12-23 18:55:32 +08:00 committed by GitHub
parent c8bdb652c4
commit a4b929385e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 63 additions and 37 deletions

View File

@ -82,6 +82,8 @@ Dask API
.. autofunction:: xgboost.dask.DaskDMatrix .. autofunction:: xgboost.dask.DaskDMatrix
.. autofunction:: xgboost.dask.train
.. autofunction:: xgboost.dask.predict .. autofunction:: xgboost.dask.predict
.. autofunction:: xgboost.dask.DaskXGBClassifier .. autofunction:: xgboost.dask.DaskXGBClassifier

View File

@ -77,6 +77,27 @@ interface with ``DaskXGBClassifier`` and ``DaskXGBRegressor``. See ``xgboost/de
for more examples. for more examples.
*****************************************************************************
Why is the initialization of ``DaskDMatrix`` so slow and throws weird errors
*****************************************************************************
The dask API in XGBoost requires construction of ``DaskDMatrix``. With ``Scikit-Learn``
interface, ``DaskDMatrix`` is implicitly constructed for each input data during `fit` or
`predict`. You might have observed its construction is taking incredible amount of time,
and sometimes throws error that doesn't seem to be relevant to `DaskDMatrix`. Here is a
brief explanation for why. By default most of dask's computation is `lazy
<https://docs.dask.org/en/latest/user-interfaces.html#laziness-and-computing>`_, which
means the computation is not carried out until you explicitly ask for result, either by
calling `compute()` or `wait()`. See above link for details in dask, and `this wiki
<https://en.wikipedia.org/wiki/Lazy_evaluation>`_ for general concept of lazy evaluation.
The `DaskDMatrix` constructor forces all lazy computation to materialize, which means it's
where all your earlier computation actually being carried out, including operations like
`dd.read_csv()`. To isolate the computation in `DaskDMatrix` from other lazy
computations, one can explicitly wait for results of input data before calling constructor
of `DaskDMatrix`. Also dask's `web interface
<https://distributed.dask.org/en/latest/web.html>`_ can be used to monitor what operations
are currently being performed.
*********** ***********
Limitations Limitations
*********** ***********

View File

@ -113,7 +113,10 @@ def _assert_client(client):
class DaskDMatrix: class DaskDMatrix:
# pylint: disable=missing-docstring, too-many-instance-attributes # pylint: disable=missing-docstring, too-many-instance-attributes
'''DMatrix holding on references to Dask DataFrame or Dask Array. '''DMatrix holding on references to Dask DataFrame or Dask Array. Constructing
a `DaskDMatrix` forces all lazy computation to be carried out. Wait for
the input data explicitly if you want to see actual computation of
constructing `DaskDMatrix`.
Parameters Parameters
---------- ----------
@ -351,7 +354,7 @@ def train(client, params, dtrain, *args, evals=(), **kwargs):
client: dask.distributed.Client client: dask.distributed.Client
Specify the dask client used for training. Use default client Specify the dask client used for training. Use default client
returned from dask if it's set to None. returned from dask if it's set to None.
\\*\\*kwargs:
Other parameters are the same as `xgboost.train` except for `evals_result`, Other parameters are the same as `xgboost.train` except for `evals_result`,
which is returned as part of function return value instead of argument. which is returned as part of function return value instead of argument.

View File

@ -136,7 +136,7 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
Edge color when meets the node condition. Edge color when meets the node condition.
no_color : str, default '#FF0000' no_color : str, default '#FF0000'
Edge color when doesn't meet the node condition. Edge color when doesn't meet the node condition.
condition_node_params : dict (optional) condition_node_params : dict, optional
Condition node configuration for for graphviz. Example: Condition node configuration for for graphviz. Example:
.. code-block:: python .. code-block:: python
@ -145,7 +145,7 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
'style': 'filled,rounded', 'style': 'filled,rounded',
'fillcolor': '#78bceb'} 'fillcolor': '#78bceb'}
leaf_node_params : dict (optional) leaf_node_params : dict, optional
Leaf node configuration for graphviz. Example: Leaf node configuration for graphviz. Example:
.. code-block:: python .. code-block:: python
@ -154,8 +154,8 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
'style': 'filled', 'style': 'filled',
'fillcolor': '#e48038'} 'fillcolor': '#e48038'}
kwargs : Other keywords passed to graphviz graph_attr, E.g.: \\*\\*kwargs: dict, optional
``graph [ {key} = {value} ]`` Other keywords passed to graphviz graph_attr, e.g. ``graph [ {key} = {value} ]``
Returns Returns
------- -------