cosmetic changes to tutorial

This commit is contained in:
nachocano 2015-01-11 01:07:51 -08:00
parent 7eb4258951
commit aea4c10847
2 changed files with 61 additions and 62 deletions

View File

@ -2,7 +2,7 @@
rabit is a light weight library that provides a fault tolerant interface of Allreduce and Broadcast. It is designed to support easy implementations of distributed machine learning programs, many of which fall naturally under the Allreduce abstraction. rabit is a light weight library that provides a fault tolerant interface of Allreduce and Broadcast. It is designed to support easy implementations of distributed machine learning programs, many of which fall naturally under the Allreduce abstraction.
* [Tutorial of Rabit](guide) * [Tutorial](guide)
* [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc) * [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc)
* You can also directly read the [interface header](include/rabit.h) * You can also directly read the [interface header](include/rabit.h)

View File

@ -1,15 +1,15 @@
Tutorial of Rabit Tutorial
===== =====
This is an tutorial of rabit, a ***Reliable Allreduce and Broadcast interface***. This is rabit's tutorial, a ***Reliable Allreduce and Broadcast Interface***.
To run the examples locally, you will need to type ```make``` to build all the examples. To run the examples locally, you will need to build them with ```make```.
Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc) Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqchen/rabit/doc) for further details.
**List of Topics** **List of Topics**
* [What is Allreduce](#what-is-allreduce) * [What is Allreduce](#what-is-allreduce)
* [Common Use Case](#common-use-case) * [Common Use Case](#common-use-case)
* [Structure of Rabit Program](#structure-of-rabit-program) * [Structure of a Rabit Program](#structure-of-rabit-program)
* [Compile Programs with Rabit](#compile-programs-with-rabit) * [Compile Programs with Rabit](#compile-programs-with-rabit)
* [Running Rabit Jobs](#running-rabit-jobs) * [Running Rabit Jobs](#running-rabit-jobs)
- [Running Rabit on Hadoop](#running-rabit-on-hadoop) - [Running Rabit on Hadoop](#running-rabit-on-hadoop)
@ -19,8 +19,8 @@ Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqc
What is Allreduce What is Allreduce
===== =====
The main method provided by rabit are Allreduce and Broadcast. Allreduce performs reduction across different computation nodes, The main methods provided by rabit are Allreduce and Broadcast. Allreduce performs reduction across different computation nodes,
and returning the results to all the nodes. To understand the behavior of the function. Consider the following example in [basic.cc](basic.cc). and returns the result to every node. To understand the behavior of the function, consider the following example in [basic.cc](basic.cc).
```c++ ```c++
#include <rabit.h> #include <rabit.h>
using namespace rabit; using namespace rabit;
@ -41,24 +41,24 @@ int main(int argc, char *argv[]) {
return 0; return 0;
} }
``` ```
You can run the example using the rabit_demo.py script. The following commmand You can run the example using the rabit_demo.py script. The following command
start rabit program with two worker processes. starts the rabit program with two worker processes.
```bash ```bash
../tracker/rabit_demo.py -n 2 basic.rabit ../tracker/rabit_demo.py -n 2 basic.rabit
``` ```
This will start two process, one process with rank 0 and another rank 1, running the same code. This will start two processes, one process with rank 0 and the other with rank 1, both processes run the same code.
The ```rabit::GetRank()``` function return the rank of current process. The ```rabit::GetRank()``` function returns the rank of current process.
Before the call the allreduce, process 0 contains array ```a = {0, 1, 2}```, while process 1 have array Before the call to Allreduce, process 0 contains the array ```a = {0, 1, 2}```, while process 1 has the array
```a = {1, 2, 3}```. After the call of Allreduce, the array contents in all processes are replaced by the ```a = {1, 2, 3}```. After the call to Allreduce, the array contents in all processes are replaced by the
reduction result (in this case, the maximum value in each position across all the processes). So after the reduction result (in this case, the maximum value in each position across all the processes). So, after the
Allreduce call, the result will become ```a = {1, 2, 3}```. Allreduce call, the result will become ```a = {1, 2, 3}```.
Rabit provides different reduction operators, for example, you can change ```op::Max``` to ```op::Sum```, Rabit provides different reduction operators, for example, if you change ```op::Max``` to ```op::Sum```,
then the reduction operation will become the summation, and the result will become ```a = {1, 3, 5}```. the reduction operation will be a summation, and the result will become ```a = {1, 3, 5}```.
You can also run example with different processes by setting -n to different values, to see the outcomming result. You can also run the example with different processes by setting -n to different values.
Broadcast is another method provided by rabit besides Allreduce, this function allows one node to broadcast its Broadcast is another method provided by rabit besides Allreduce. This function allows one node to broadcast its
local data to all the other nodes. The following code in [broadcast.cc](broadcast.cc) broadcast a string from local data to all other nodes. The following code in [broadcast.cc](broadcast.cc) broadcasts a string from
node 0 to all other nodes. node 0 to all other nodes.
```c++ ```c++
#include <rabit.h> #include <rabit.h>
@ -78,30 +78,29 @@ int main(int argc, char *argv[]) {
return 0; return 0;
} }
``` ```
You can run the program by the following command, using three workers. The following command starts the program with three worker processes.
```bash ```bash
../tracker/rabit_demo.py -n 3 broadcast.rabit ../tracker/rabit_demo.py -n 3 broadcast.rabit
``` ```
Besides string, rabit also allows broadcast of constant size array and vector. Besides strings, rabit also allows to broadcast constant size array and vectors.
Common Use Case Common Use Case
===== =====
Many distributed machine learning algorithm involves dividing the data into each node, Many distributed machine learning algorithms involve splitting the data into different nodes,
compute statistics locally and aggregates them together. Such process is usually done repeatively in computing statistics locally, and finally aggregating them. Such workflow is usually done repetitively through
many iterations before the algorithm converge. Allreduce naturally meets the need of such programs, many iterations before the algorithm converges. Allreduce naturally meets the structure of such programs,
common use cases include: common use cases include:
* Aggregation of gradient values, which can be used in optimization methods such as L-BFGS. * Aggregation of gradient values, which can be used in optimization methods such as L-BFGS.
* Aggregation of other statistics, which can be used in KMeans and Gaussian Mixture Model. * Aggregation of other statistics, which can be used in KMeans and Gaussian Mixture Models.
* Find the best split candidate and aggregation of split statistics, used for tree based models. * Find the best split candidate and aggregation of split statistics, used for tree based models.
The main purpose of Rabit is to provide reliable and portable library for distributed machine learning programs. Rabit is a reliable and portable library for distributed machine learning programs, that allow programs to run reliably on different platforms.
So that the program can be run reliably on different types of platforms.
Structure of Rabit Program Structure of a Rabit Program
===== =====
The following code illustrates the common structure of rabit program. This is an abstract example, The following code illustrates the common structure of a rabit program. This is an abstract example,
you can also refer to [kmeans.cc](../toolkit/kmeans.cc) for an example implementation of kmeans. you can also refer to [kmeans.cc](../toolkit/kmeans.cc) for an example implementation of kmeans algorithm.
```c++ ```c++
#include <rabit.h> #include <rabit.h>
@ -127,65 +126,65 @@ int main(int argc, char *argv[]) {
} }
``` ```
Besides the common Allreduce and Broadcast function, there are two additional functions: ```CheckPoint``` Besides the common Allreduce and Broadcast functions, there are two additional functions: ```LoadCheckPoint```
and ```CheckPoint```. These two functions are used for fault-tolerance purpose. and ```CheckPoint```. These two functions are used for fault-tolerance purposes.
Common machine learning programs involves several iterations. In each iteration, we start from a model, do some calls As mentioned before, traditional machine learning programs involve several iterations. In each iteration, we start with a model, make some calls
to Allreduce or Broadcasts and update the model to a new one. The calling sequence in each iteration does not need to be the same. to Allreduce or Broadcast and update the model. The calling sequence in each iteration does not need to be the same.
* When the nodes start from beginning, LoadCheckPoint returns 0, and we can initialize the model. * When the nodes start from the beginning (i.e. iteration 0), LoadCheckPoint returns 0, so we can initialize the model.
* ```CheckPoint``` saves the model after each iteration. * ```CheckPoint``` saves the model after each iteration.
- Efficiency Note: the model is only kept in local memory and no save to disk is involved in Checkpoint - Efficiency Note: the model is only kept in local memory and no save to disk is performed when calling Checkpoint
* When a node goes down and restarts, ```LoadCheckPoint``` will recover the latest saved model, and * When a node goes down and restarts, ```LoadCheckPoint``` will recover the latest saved model, and
* When a node goes down, the rest of the node will block in the call of Allreduce/Broadcast and helps * When a node goes down, the rest of the nodes will block in the call of Allreduce/Broadcast and wait for
the recovery of the failure nodes, util it catches up. the recovery of the failed node until it catches up.
Please also see the section of [fault tolerance procedure](#fault-tolerance) in rabit to understand the recovery procedure under going in rabit Please see the [fault tolerance procedure](#fault-tolerance) section to understand the recovery procedure executed by rabit.
Compile Programs with Rabit Compile Programs with Rabit
==== ====
Rabit is a portable library, to use it, you only need to include the rabit header file. Rabit is a portable library, to use it, you only need to include the rabit header file.
* You will need to add path to [../include](../include) to the header search path of compiler * You will need to add the path to [../include](../include) to the header search path of the compiler
- Solution 1: add ```-I/path/to/rabit/include``` to the compiler flag in gcc or clang - Solution 1: add ```-I/path/to/rabit/include``` to the compiler flag in gcc or clang
- Solution 2: add the path to enviroment variable CPLUS_INCLUDE_PATH - Solution 2: add the path to the environment variable CPLUS_INCLUDE_PATH
* You will need to add path to [../lib](../lib) to the library search path of compiler * You will need to add the path to [../lib](../lib) to the library search path of the compiler
- Solution 1: add ```-L/path/to/rabit/lib``` to the linker flag - Solution 1: add ```-L/path/to/rabit/lib``` to the linker flag
- Solution 2: add the path to enviroment variable LIBRARY_PATH AND LD_LIBRARY_PATH - Solution 2: add the path to environment variable LIBRARY_PATH AND LD_LIBRARY_PATH
* Link against lib/rabit.a * Link against lib/rabit.a
- Add ```-lrabit``` to linker flag - Add ```-lrabit``` to the linker flag
The procedure above allows you to compile a program with rabit. The following two sections are additional The procedure above allows you to compile a program with rabit. The following two sections contain additional
advanced options you can take to link against different backend other than the normal one. options you can use to link against different backends other than the normal one.
#### Link against MPI Allreduce #### Link against MPI Allreduce
You can link against ```rabit_mpi.a``` instead to use MPI Allreduce, however, the resulting program is backed by MPI and You can link against ```rabit_mpi.a``` instead of using MPI Allreduce, however, the resulting program is backed by MPI and
is not fault tolerant anymore. is not fault tolerant anymore.
* Simply change linker flag from ```-lrabit``` to ```-lrabit_mpi``` * Simply change the linker flag from ```-lrabit``` to ```-lrabit_mpi```
* The final linking needs to be done by mpi wrapper compiler ```mpicxx``` * The final linking needs to be done by mpi wrapper compiler ```mpicxx```
#### Link against Mock Test Rabit Library #### Link against Mock Test Rabit Library
If you want to mock test the program to see the behavior of the code when some nodes goes down. You can link against ```rabit_mock.a``` . If you want to use a mock to test the program in order to see the behavior of the code when some nodes go down, you can link against ```rabit_mock.a``` .
* Simply change linker flag from ```-lrabit``` to ```-lrabit_mock``` * Simply change the linker flag from ```-lrabit``` to ```-lrabit_mock```
The resulting rabit program can take in additional arguments in format of The resulting rabit mock program can take in additional arguments in the following format
``` ```
mock=rank,version,seq,ndeath mock=rank,version,seq,ndeath
``` ```
The four integers specifies an event that will cause the program to suicide(exit with -2) The four integers specify an event that will cause the program to ```commit suicide```(exit with -2)
* rank specifies the rank of the node * rank specifies the rank of the node to kill
* version specifies the current version(iteration) of the model * version specifies the version (iteration) of the model where you want the process to die
* seq specifies the sequence number of Allreduce/Broadcast call since last checkpoint * seq specifies the sequence number of the Allreduce/Broadcast call since last checkpoint, where the process will be killed
* ndeath specifies how many times this node died already * ndeath specifies how many times this node died already
For example, consider the following script in the test case For example, consider the following script in the test case
```bash ```bash
../tracker/rabit_demo.py -n 10 test_model_recover 10000\ ../tracker/rabit_demo.py -n 10 test_model_recover 10000\
mock=0,0,1,0 mock=1,1,1,0 mock=1,1,1,1 mock=0,0,1,0 mock=1,1,1,0 mock=1,1,1,1
``` ```
* The first mock will cause node 0 to exit when calling second Allreduce/Broadcast (seq = 1) in iteration 0 * The first mock will cause node 0 to exit when calling the second Allreduce/Broadcast (seq = 1) in iteration 0
* The second mock will cause node 1 to exit when calling second Allreduce/Broadcast (seq = 1) in iteration 1 * The second mock will cause node 1 to exit when calling the second Allreduce/Broadcast (seq = 1) in iteration 1
* The second mock will cause node 0 to exit again when calling second Allreduce/Broadcast (seq = 1) in iteration 1 * The third mock will cause node 1 to exit again when calling second Allreduce/Broadcast (seq = 1) in iteration 1
- Note that ndeath = 1 means this will happen only if node 0 died once, which is our case - Note that ndeath = 1 means this will happen only if node 1 died once, which is our case
Running Rabit Jobs Running Rabit Jobs
==== ====