Note for DaskDMatrix. (#5144)

* Brief introduction to `DaskDMatrix`.

* Add xgboost.dask.train to API doc
This commit is contained in:
Jiaming Yuan 2019-12-23 18:55:32 +08:00 committed by GitHub
parent c8bdb652c4
commit a4b929385e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 63 additions and 37 deletions

View File

@ -82,6 +82,8 @@ Dask API
.. autofunction:: xgboost.dask.DaskDMatrix
.. autofunction:: xgboost.dask.train
.. autofunction:: xgboost.dask.predict
.. autofunction:: xgboost.dask.DaskXGBClassifier

View File

@ -77,6 +77,27 @@ interface with ``DaskXGBClassifier`` and ``DaskXGBRegressor``. See ``xgboost/de
for more examples.
*****************************************************************************
Why is the initialization of ``DaskDMatrix`` so slow and throws weird errors
*****************************************************************************
The dask API in XGBoost requires construction of ``DaskDMatrix``. With ``Scikit-Learn``
interface, ``DaskDMatrix`` is implicitly constructed for each input data during `fit` or
`predict`. You might have observed its construction is taking incredible amount of time,
and sometimes throws error that doesn't seem to be relevant to `DaskDMatrix`. Here is a
brief explanation for why. By default most of dask's computation is `lazy
<https://docs.dask.org/en/latest/user-interfaces.html#laziness-and-computing>`_, which
means the computation is not carried out until you explicitly ask for result, either by
calling `compute()` or `wait()`. See above link for details in dask, and `this wiki
<https://en.wikipedia.org/wiki/Lazy_evaluation>`_ for general concept of lazy evaluation.
The `DaskDMatrix` constructor forces all lazy computation to materialize, which means it's
where all your earlier computation actually being carried out, including operations like
`dd.read_csv()`. To isolate the computation in `DaskDMatrix` from other lazy
computations, one can explicitly wait for results of input data before calling constructor
of `DaskDMatrix`. Also dask's `web interface
<https://distributed.dask.org/en/latest/web.html>`_ can be used to monitor what operations
are currently being performed.
***********
Limitations
***********

View File

@ -113,7 +113,10 @@ def _assert_client(client):
class DaskDMatrix:
# pylint: disable=missing-docstring, too-many-instance-attributes
'''DMatrix holding on references to Dask DataFrame or Dask Array.
'''DMatrix holding on references to Dask DataFrame or Dask Array. Constructing
a `DaskDMatrix` forces all lazy computation to be carried out. Wait for
the input data explicitly if you want to see actual computation of
constructing `DaskDMatrix`.
Parameters
----------
@ -351,7 +354,7 @@ def train(client, params, dtrain, *args, evals=(), **kwargs):
client: dask.distributed.Client
Specify the dask client used for training. Use default client
returned from dask if it's set to None.
\\*\\*kwargs:
Other parameters are the same as `xgboost.train` except for `evals_result`,
which is returned as part of function return value instead of argument.

View File

@ -136,7 +136,7 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
Edge color when meets the node condition.
no_color : str, default '#FF0000'
Edge color when doesn't meet the node condition.
condition_node_params : dict (optional)
condition_node_params : dict, optional
Condition node configuration for for graphviz. Example:
.. code-block:: python
@ -145,7 +145,7 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
'style': 'filled,rounded',
'fillcolor': '#78bceb'}
leaf_node_params : dict (optional)
leaf_node_params : dict, optional
Leaf node configuration for graphviz. Example:
.. code-block:: python
@ -154,8 +154,8 @@ def to_graphviz(booster, fmap='', num_trees=0, rankdir=None,
'style': 'filled',
'fillcolor': '#e48038'}
kwargs : Other keywords passed to graphviz graph_attr, E.g.:
``graph [ {key} = {value} ]``
\\*\\*kwargs: dict, optional
Other keywords passed to graphviz graph_attr, e.g. ``graph [ {key} = {value} ]``
Returns
-------