guide stuff

This commit is contained in:
nachocano 2015-01-11 01:43:32 -08:00
parent 2d97833f48
commit d269cb9c50

View File

@ -251,8 +251,7 @@ This fault tolerance model is based on a key property of Allreduce and Broadcast
All the nodes get the same result when calling Allreduce/Broadcast. Because of this property, any node can record the history,
and when a node recovers, the result can be forwarded to it.
The checkpoint is introduced so that we can discard the history after checkpointing, so that the iterative program can be more
efficient. The strategy of rabit is different from the fail-restart strategy where all the nodes restart from the same checkpoint
The checkpoint is introduced so that we can discard the history after checkpointing, this makes the iterative program more efficient. The strategy of rabit is different from the fail-restart strategy where all the nodes restart from the same checkpoint
when any of them fail. All the processes will block in the Allreduce call to help the recovery, and the checkpoint is only saved locally without
touching the disk. This makes rabit programs more reliable and efficient.