[doc][dask] Note on reproducible result. [skip ci] (#8903)
This commit is contained in:
parent
3689695d16
commit
bbee355b45
@ -564,6 +564,70 @@ computations, one can explicitly wait for results of input data before construct
|
||||
Also dask's `diagnostics dashboard <https://distributed.dask.org/en/latest/web.html>`_ can be used to
|
||||
monitor what operations are currently being performed.
|
||||
|
||||
*******************
|
||||
Reproducible Result
|
||||
*******************
|
||||
|
||||
In a single node mode, we can always expect the same training result between runs as along
|
||||
as the underlying platforms are the same. However, it's difficult to obtain reproducible
|
||||
result in a distributed environment, since the tasks might get different machine
|
||||
allocation or have different amount of available resources during different
|
||||
sessions. There are heuristics and guidelines on how to achieve it but no proven method
|
||||
for guaranteeing such deterministic behavior. The Dask interface in XGBoost tries to
|
||||
provide reproducible result with best effort. This section highlights some known criteria
|
||||
and try share some insight into the issue.
|
||||
|
||||
There are primarily two different tasks for XGBoost the carry out, training and
|
||||
inference. Inference is reproducible given the same software and hardware along with the
|
||||
same run-time configurations like number of threads. The remaining of this section will
|
||||
focus on training.
|
||||
|
||||
Many of the challenges come from the fact that we are using approximation algorithms, The
|
||||
sketching algorithm used to find histogram bins is an approximation to the exact quantile
|
||||
algorithm, the `AUC` metric in a distributed environment is an approximation to the exact
|
||||
`AUC` score, and floating-point number if an approximation to real numbers. Floating point
|
||||
is an issue as its summation is not associative, meaning :math:`(a + b) + c` does not
|
||||
necessarily equal to :math:`a + (b + c)`, even though this property holds true for real
|
||||
number. As a result, whenever we change the order of summation, the result can
|
||||
differ. This imposes the requirement that, in order to have reproducible output from
|
||||
XGBoost, the entire pipeline needs to be reproducible.
|
||||
|
||||
- The software stack is the same for each runs. This goes without saying. XGBoost might
|
||||
generate different outputs between different versions. This is expected as we might
|
||||
change the default value of hyper-parameter, or the parallel strategy that generates
|
||||
different floating point result. We guarantee the correctness the algorithms, but there
|
||||
are lots of wiggle room for the final output. The situation is similar for many
|
||||
dependencies, for instance, the random number generator might differ from platform to
|
||||
platform.
|
||||
|
||||
- The hardware stack is the same for each runs. This includes the number of workers, and
|
||||
the amount of available resources on each worker. XGBoost can generate different results
|
||||
using different number of workers. This is caused by the approximation issue mentioned
|
||||
previously.
|
||||
|
||||
- Similar to the hardware constraint, the network topology is also a factor in final
|
||||
output. If we change topology the workers might be ordered differently, leading to
|
||||
different ordering of floating-point operations.
|
||||
|
||||
- The random seed used in various place of the pipeline.
|
||||
|
||||
- The partitioning of data needs to be reproducible. This is related to the available
|
||||
resources on each worker. Dask might partition the data differently for each run
|
||||
according to its own scheduling policy. For instance, if there are some additional tasks
|
||||
in the cluster while you are running the second training session for XGBoost, some of
|
||||
the workers might have constrained memory and Dask may not push the training data for
|
||||
XGBoost to that worker. This change in data partitioning can lead to different output
|
||||
models. If you are using a shared Dask cluster, then the result is likely to vary
|
||||
between runs.
|
||||
|
||||
- The operations performed on dataframes need to be reproducible. There are some
|
||||
operations like `DataFrame.merge` not being deterministic on parallel hardwares like GPU
|
||||
where the order of the index of merge result might differ from run to run.
|
||||
|
||||
It's expected to have different results when training the model on distributed environment
|
||||
than training the model using a single node due to aforementioned criteria.
|
||||
|
||||
|
||||
************
|
||||
Memory Usage
|
||||
************
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user