cosmetic changes to tutorial
This commit is contained in:
parent
7eb4258951
commit
aea4c10847
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
rabit is a light weight library that provides a fault tolerant interface of Allreduce and Broadcast. It is designed to support easy implementations of distributed machine learning programs, many of which fall naturally under the Allreduce abstraction.
|
rabit is a light weight library that provides a fault tolerant interface of Allreduce and Broadcast. It is designed to support easy implementations of distributed machine learning programs, many of which fall naturally under the Allreduce abstraction.
|
||||||
|
|
||||||
* [Tutorial of Rabit](guide)
|
* [Tutorial](guide)
|
||||||
* [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc)
|
* [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc)
|
||||||
* You can also directly read the [interface header](include/rabit.h)
|
* You can also directly read the [interface header](include/rabit.h)
|
||||||
|
|
||||||
|
|||||||
121
guide/README.md
121
guide/README.md
@ -1,15 +1,15 @@
|
|||||||
Tutorial of Rabit
|
Tutorial
|
||||||
=====
|
=====
|
||||||
This is an tutorial of rabit, a ***Reliable Allreduce and Broadcast interface***.
|
This is rabit's tutorial, a ***Reliable Allreduce and Broadcast Interface***.
|
||||||
To run the examples locally, you will need to type ```make``` to build all the examples.
|
To run the examples locally, you will need to build them with ```make```.
|
||||||
|
|
||||||
Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc)
|
Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc) for further details.
|
||||||
|
|
||||||
|
|
||||||
**List of Topics**
|
**List of Topics**
|
||||||
* [What is Allreduce](#what-is-allreduce)
|
* [What is Allreduce](#what-is-allreduce)
|
||||||
* [Common Use Case](#common-use-case)
|
* [Common Use Case](#common-use-case)
|
||||||
* [Structure of Rabit Program](#structure-of-rabit-program)
|
* [Structure of a Rabit Program](#structure-of-rabit-program)
|
||||||
* [Compile Programs with Rabit](#compile-programs-with-rabit)
|
* [Compile Programs with Rabit](#compile-programs-with-rabit)
|
||||||
* [Running Rabit Jobs](#running-rabit-jobs)
|
* [Running Rabit Jobs](#running-rabit-jobs)
|
||||||
- [Running Rabit on Hadoop](#running-rabit-on-hadoop)
|
- [Running Rabit on Hadoop](#running-rabit-on-hadoop)
|
||||||
@ -19,8 +19,8 @@ Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqc
|
|||||||
|
|
||||||
What is Allreduce
|
What is Allreduce
|
||||||
=====
|
=====
|
||||||
The main method provided by rabit are Allreduce and Broadcast. Allreduce performs reduction across different computation nodes,
|
The main methods provided by rabit are Allreduce and Broadcast. Allreduce performs reduction across different computation nodes,
|
||||||
and returning the results to all the nodes. To understand the behavior of the function. Consider the following example in [basic.cc](basic.cc).
|
and returns the result to every node. To understand the behavior of the function, consider the following example in [basic.cc](basic.cc).
|
||||||
```c++
|
```c++
|
||||||
#include <rabit.h>
|
#include <rabit.h>
|
||||||
using namespace rabit;
|
using namespace rabit;
|
||||||
@ -41,24 +41,24 @@ int main(int argc, char *argv[]) {
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
You can run the example using the rabit_demo.py script. The following commmand
|
You can run the example using the rabit_demo.py script. The following command
|
||||||
start rabit program with two worker processes.
|
starts the rabit program with two worker processes.
|
||||||
```bash
|
```bash
|
||||||
../tracker/rabit_demo.py -n 2 basic.rabit
|
../tracker/rabit_demo.py -n 2 basic.rabit
|
||||||
```
|
```
|
||||||
This will start two process, one process with rank 0 and another rank 1, running the same code.
|
This will start two processes, one process with rank 0 and the other with rank 1, both processes run the same code.
|
||||||
The ```rabit::GetRank()``` function return the rank of current process.
|
The ```rabit::GetRank()``` function returns the rank of current process.
|
||||||
|
|
||||||
Before the call the allreduce, process 0 contains array ```a = {0, 1, 2}```, while process 1 have array
|
Before the call to Allreduce, process 0 contains the array ```a = {0, 1, 2}```, while process 1 has the array
|
||||||
```a = {1, 2, 3}```. After the call of Allreduce, the array contents in all processes are replaced by the
|
```a = {1, 2, 3}```. After the call to Allreduce, the array contents in all processes are replaced by the
|
||||||
reduction result (in this case, the maximum value in each position across all the processes). So after the
|
reduction result (in this case, the maximum value in each position across all the processes). So, after the
|
||||||
Allreduce call, the result will become ```a = {1, 2, 3}```.
|
Allreduce call, the result will become ```a = {1, 2, 3}```.
|
||||||
Rabit provides different reduction operators, for example, you can change ```op::Max``` to ```op::Sum```,
|
Rabit provides different reduction operators, for example, if you change ```op::Max``` to ```op::Sum```,
|
||||||
then the reduction operation will become the summation, and the result will become ```a = {1, 3, 5}```.
|
the reduction operation will be a summation, and the result will become ```a = {1, 3, 5}```.
|
||||||
You can also run example with different processes by setting -n to different values, to see the outcomming result.
|
You can also run the example with different processes by setting -n to different values.
|
||||||
|
|
||||||
Broadcast is another method provided by rabit besides Allreduce, this function allows one node to broadcast its
|
Broadcast is another method provided by rabit besides Allreduce. This function allows one node to broadcast its
|
||||||
local data to all the other nodes. The following code in [broadcast.cc](broadcast.cc) broadcast a string from
|
local data to all other nodes. The following code in [broadcast.cc](broadcast.cc) broadcasts a string from
|
||||||
node 0 to all other nodes.
|
node 0 to all other nodes.
|
||||||
```c++
|
```c++
|
||||||
#include <rabit.h>
|
#include <rabit.h>
|
||||||
@ -78,30 +78,29 @@ int main(int argc, char *argv[]) {
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
You can run the program by the following command, using three workers.
|
The following command starts the program with three worker processes.
|
||||||
```bash
|
```bash
|
||||||
../tracker/rabit_demo.py -n 3 broadcast.rabit
|
../tracker/rabit_demo.py -n 3 broadcast.rabit
|
||||||
```
|
```
|
||||||
Besides string, rabit also allows broadcast of constant size array and vector.
|
Besides strings, rabit also allows to broadcast constant size array and vectors.
|
||||||
|
|
||||||
Common Use Case
|
Common Use Case
|
||||||
=====
|
=====
|
||||||
Many distributed machine learning algorithm involves dividing the data into each node,
|
Many distributed machine learning algorithms involve splitting the data into different nodes,
|
||||||
compute statistics locally and aggregates them together. Such process is usually done repeatively in
|
computing statistics locally, and finally aggregating them. Such workflow is usually done repetitively through
|
||||||
many iterations before the algorithm converge. Allreduce naturally meets the need of such programs,
|
many iterations before the algorithm converges. Allreduce naturally meets the structure of such programs,
|
||||||
common use cases include:
|
common use cases include:
|
||||||
|
|
||||||
* Aggregation of gradient values, which can be used in optimization methods such as L-BFGS.
|
* Aggregation of gradient values, which can be used in optimization methods such as L-BFGS.
|
||||||
* Aggregation of other statistics, which can be used in KMeans and Gaussian Mixture Model.
|
* Aggregation of other statistics, which can be used in KMeans and Gaussian Mixture Models.
|
||||||
* Find the best split candidate and aggregation of split statistics, used for tree based models.
|
* Find the best split candidate and aggregation of split statistics, used for tree based models.
|
||||||
|
|
||||||
The main purpose of Rabit is to provide reliable and portable library for distributed machine learning programs.
|
Rabit is a reliable and portable library for distributed machine learning programs, that allow programs to run reliably on different platforms.
|
||||||
So that the program can be run reliably on different types of platforms.
|
|
||||||
|
|
||||||
Structure of Rabit Program
|
Structure of a Rabit Program
|
||||||
=====
|
=====
|
||||||
The following code illustrates the common structure of rabit program. This is an abstract example,
|
The following code illustrates the common structure of a rabit program. This is an abstract example,
|
||||||
you can also refer to [kmeans.cc](../toolkit/kmeans.cc) for an example implementation of kmeans.
|
you can also refer to [kmeans.cc](../toolkit/kmeans.cc) for an example implementation of kmeans algorithm.
|
||||||
|
|
||||||
```c++
|
```c++
|
||||||
#include <rabit.h>
|
#include <rabit.h>
|
||||||
@ -127,65 +126,65 @@ int main(int argc, char *argv[]) {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Besides the common Allreduce and Broadcast function, there are two additional functions: ```CheckPoint```
|
Besides the common Allreduce and Broadcast functions, there are two additional functions: ```LoadCheckPoint```
|
||||||
and ```CheckPoint```. These two functions are used for fault-tolerance purpose.
|
and ```CheckPoint```. These two functions are used for fault-tolerance purposes.
|
||||||
Common machine learning programs involves several iterations. In each iteration, we start from a model, do some calls
|
As mentioned before, traditional machine learning programs involve several iterations. In each iteration, we start with a model, make some calls
|
||||||
to Allreduce or Broadcasts and update the model to a new one. The calling sequence in each iteration does not need to be the same.
|
to Allreduce or Broadcast and update the model. The calling sequence in each iteration does not need to be the same.
|
||||||
|
|
||||||
* When the nodes start from beginning, LoadCheckPoint returns 0, and we can initialize the model.
|
* When the nodes start from the beginning (i.e. iteration 0), LoadCheckPoint returns 0, so we can initialize the model.
|
||||||
* ```CheckPoint``` saves the model after each iteration.
|
* ```CheckPoint``` saves the model after each iteration.
|
||||||
- Efficiency Note: the model is only kept in local memory and no save to disk is involved in Checkpoint
|
- Efficiency Note: the model is only kept in local memory and no save to disk is performed when calling Checkpoint
|
||||||
* When a node goes down and restarts, ```LoadCheckPoint``` will recover the latest saved model, and
|
* When a node goes down and restarts, ```LoadCheckPoint``` will recover the latest saved model, and
|
||||||
* When a node goes down, the rest of the node will block in the call of Allreduce/Broadcast and helps
|
* When a node goes down, the rest of the nodes will block in the call of Allreduce/Broadcast and wait for
|
||||||
the recovery of the failure nodes, util it catches up.
|
the recovery of the failed node until it catches up.
|
||||||
|
|
||||||
Please also see the section of [fault tolerance procedure](#fault-tolerance) in rabit to understand the recovery procedure under going in rabit
|
Please see the [fault tolerance procedure](#fault-tolerance) section to understand the recovery procedure executed by rabit.
|
||||||
|
|
||||||
Compile Programs with Rabit
|
Compile Programs with Rabit
|
||||||
====
|
====
|
||||||
Rabit is a portable library, to use it, you only need to include the rabit header file.
|
Rabit is a portable library, to use it, you only need to include the rabit header file.
|
||||||
* You will need to add path to [../include](../include) to the header search path of compiler
|
* You will need to add the path to [../include](../include) to the header search path of the compiler
|
||||||
- Solution 1: add ```-I/path/to/rabit/include``` to the compiler flag in gcc or clang
|
- Solution 1: add ```-I/path/to/rabit/include``` to the compiler flag in gcc or clang
|
||||||
- Solution 2: add the path to enviroment variable CPLUS_INCLUDE_PATH
|
- Solution 2: add the path to the environment variable CPLUS_INCLUDE_PATH
|
||||||
* You will need to add path to [../lib](../lib) to the library search path of compiler
|
* You will need to add the path to [../lib](../lib) to the library search path of the compiler
|
||||||
- Solution 1: add ```-L/path/to/rabit/lib``` to the linker flag
|
- Solution 1: add ```-L/path/to/rabit/lib``` to the linker flag
|
||||||
- Solution 2: add the path to enviroment variable LIBRARY_PATH AND LD_LIBRARY_PATH
|
- Solution 2: add the path to environment variable LIBRARY_PATH AND LD_LIBRARY_PATH
|
||||||
* Link against lib/rabit.a
|
* Link against lib/rabit.a
|
||||||
- Add ```-lrabit``` to linker flag
|
- Add ```-lrabit``` to the linker flag
|
||||||
|
|
||||||
The procedure above allows you to compile a program with rabit. The following two sections are additional
|
The procedure above allows you to compile a program with rabit. The following two sections contain additional
|
||||||
advanced options you can take to link against different backend other than the normal one.
|
options you can use to link against different backends other than the normal one.
|
||||||
|
|
||||||
#### Link against MPI Allreduce
|
#### Link against MPI Allreduce
|
||||||
You can link against ```rabit_mpi.a``` instead to use MPI Allreduce, however, the resulting program is backed by MPI and
|
You can link against ```rabit_mpi.a``` instead of using MPI Allreduce, however, the resulting program is backed by MPI and
|
||||||
is not fault tolerant anymore.
|
is not fault tolerant anymore.
|
||||||
* Simply change linker flag from ```-lrabit``` to ```-lrabit_mpi```
|
* Simply change the linker flag from ```-lrabit``` to ```-lrabit_mpi```
|
||||||
* The final linking needs to be done by mpi wrapper compiler ```mpicxx```
|
* The final linking needs to be done by mpi wrapper compiler ```mpicxx```
|
||||||
|
|
||||||
#### Link against Mock Test Rabit Library
|
#### Link against Mock Test Rabit Library
|
||||||
If you want to mock test the program to see the behavior of the code when some nodes goes down. You can link against ```rabit_mock.a``` .
|
If you want to use a mock to test the program in order to see the behavior of the code when some nodes go down, you can link against ```rabit_mock.a``` .
|
||||||
* Simply change linker flag from ```-lrabit``` to ```-lrabit_mock```
|
* Simply change the linker flag from ```-lrabit``` to ```-lrabit_mock```
|
||||||
|
|
||||||
The resulting rabit program can take in additional arguments in format of
|
The resulting rabit mock program can take in additional arguments in the following format
|
||||||
```
|
```
|
||||||
mock=rank,version,seq,ndeath
|
mock=rank,version,seq,ndeath
|
||||||
```
|
```
|
||||||
|
|
||||||
The four integers specifies an event that will cause the program to suicide(exit with -2)
|
The four integers specify an event that will cause the program to ```commit suicide```(exit with -2)
|
||||||
* rank specifies the rank of the node
|
* rank specifies the rank of the node to kill
|
||||||
* version specifies the current version(iteration) of the model
|
* version specifies the version (iteration) of the model where you want the process to die
|
||||||
* seq specifies the sequence number of Allreduce/Broadcast call since last checkpoint
|
* seq specifies the sequence number of the Allreduce/Broadcast call since last checkpoint, where the process will be killed
|
||||||
* ndeath specifies how many times this node died already
|
* ndeath specifies how many times this node died already
|
||||||
|
|
||||||
For example, consider the following script in the test case
|
For example, consider the following script in the test case
|
||||||
```bash
|
```bash
|
||||||
../tracker/rabit_demo.py -n 10 test_model_recover 10000\
|
../tracker/rabit_demo.py -n 10 test_model_recover 10000\
|
||||||
mock=0,0,1,0 mock=1,1,1,0 mock=1,1,1,1
|
mock=0,0,1,0 mock=1,1,1,0 mock=1,1,1,1
|
||||||
```
|
```
|
||||||
* The first mock will cause node 0 to exit when calling second Allreduce/Broadcast (seq = 1) in iteration 0
|
* The first mock will cause node 0 to exit when calling the second Allreduce/Broadcast (seq = 1) in iteration 0
|
||||||
* The second mock will cause node 1 to exit when calling second Allreduce/Broadcast (seq = 1) in iteration 1
|
* The second mock will cause node 1 to exit when calling the second Allreduce/Broadcast (seq = 1) in iteration 1
|
||||||
* The second mock will cause node 0 to exit again when calling second Allreduce/Broadcast (seq = 1) in iteration 1
|
* The third mock will cause node 1 to exit again when calling second Allreduce/Broadcast (seq = 1) in iteration 1
|
||||||
- Note that ndeath = 1 means this will happen only if node 0 died once, which is our case
|
- Note that ndeath = 1 means this will happen only if node 1 died once, which is our case
|
||||||
|
|
||||||
Running Rabit Jobs
|
Running Rabit Jobs
|
||||||
====
|
====
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user