diff --git a/guide/README.md b/guide/README.md index 93f2f210f..a2ef94160 100644 --- a/guide/README.md +++ b/guide/README.md @@ -251,8 +251,7 @@ This fault tolerance model is based on a key property of Allreduce and Broadcast All the nodes get the same result when calling Allreduce/Broadcast. Because of this property, any node can record the history, and when a node recovers, the result can be forwarded to it. -The checkpoint is introduced so that we can discard the history after checkpointing, so that the iterative program can be more -efficient. The strategy of rabit is different from the fail-restart strategy where all the nodes restart from the same checkpoint +The checkpoint is introduced so that we can discard the history after checkpointing, this makes the iterative program more efficient. The strategy of rabit is different from the fail-restart strategy where all the nodes restart from the same checkpoint when any of them fail. All the processes will block in the Allreduce call to help the recovery, and the checkpoint is only saved locally without touching the disk. This makes rabit programs more reliable and efficient.