update doc
This commit is contained in:
112
guide/README.md
112
guide/README.md
@@ -10,10 +10,12 @@ Please also refer to the [API Documentation](http://homes.cs.washington.edu/~tqc
|
||||
* [What is Allreduce](#what-is-allreduce)
|
||||
* [Common Use Case](#common-use-case)
|
||||
* [Structure of Rabit Program](#structure-of-rabit-program)
|
||||
* [Fault Tolerance](#fault-tolerance)
|
||||
* [Compile Programs with Rabit](#compile-programs-with-rabit)
|
||||
* [Running Rabit Jobs](#running-rabit-jobs)
|
||||
- [Running Rabit on Hadoop](#running-rabit-on-hadoop)
|
||||
- [Running Rabit using MPI](#running-rabit-using-mpi)
|
||||
- [Customize Tracker Script](#customize-tracker-script)
|
||||
* [Fault Tolerance](#fault-tolerance)
|
||||
|
||||
What is Allreduce
|
||||
=====
|
||||
@@ -137,7 +139,101 @@ to Allreduce or Broadcasts and update the model to a new one. The calling sequen
|
||||
* When a node goes down, the rest of the node will block in the call of Allreduce/Broadcast and helps
|
||||
the recovery of the failure nodes, util it catches up.
|
||||
|
||||
Please also see the next section for introduction of fault tolerance procedure in rabit.
|
||||
Please also see the section of [fault tolerance procedure](#fault-tolerance) in rabit to understand the recovery procedure under going in rabit
|
||||
|
||||
Compile Programs with Rabit
|
||||
====
|
||||
Rabit is a portable library, to use it, you only need to include the rabit header file.
|
||||
* You will need to add path to [../include](../include) to the header search path of compiler
|
||||
- Solution 1: add ```-I/path/to/rabit/include``` to the compiler flag in gcc or clang
|
||||
- Solution 2: add the path to enviroment variable CPLUS_INCLUDE_PATH
|
||||
* You will need to add path to [../lib](../lib) to the library search path of compiler
|
||||
- Solution 1: add ```-L/path/to/rabit/lib``` to the linker flag
|
||||
- Solution 2: add the path to enviroment variable LIBRARY_PATH AND LD_LIBRARY_PATH
|
||||
* Link against lib/rabit.a
|
||||
- Add ```-lrabit``` to linker flag
|
||||
|
||||
The procedure above allows you to compile a program with rabit. The following two sections are additional
|
||||
advanced options you can take to link against different backend other than the normal one.
|
||||
|
||||
#### Link against MPI Allreduce
|
||||
You can link against ```rabit_mpi.a``` instead to use MPI Allreduce, however, the resulting program is backed by MPI and
|
||||
is not fault tolerant anymore.
|
||||
* Simply change linker flag from ```-lrabit``` to ```-lrabit_mpi```
|
||||
* The final linking needs to be done by mpi wrapper compiler ```mpicxx```
|
||||
|
||||
#### Link against Mock Test Rabit Library
|
||||
If you want to mock test the program to see the behavior of the code when some nodes goes down. You can link against ```rabit_mock.a``` .
|
||||
* Simply change linker flag from ```-lrabit``` to ```-lrabit_mock```
|
||||
|
||||
The resulting rabit program can take in additional arguments in format of
|
||||
```
|
||||
mock=rank,version,seq,ndeath
|
||||
```
|
||||
|
||||
The four integers specifies an event that will cause the program to suicide(exit with -2)
|
||||
* rank specifies the rank of the node
|
||||
* version specifies the current version(iteration) of the model
|
||||
* seq specifies the sequence number of Allreduce/Broadcast call since last checkpoint
|
||||
* ndeath specifies how many times this node died already
|
||||
|
||||
For example, consider the following script in the test case
|
||||
```bash
|
||||
../tracker/rabit_demo.py -n 10 test_model_recover 10000\
|
||||
mock=0,0,1,0 mock=1,1,1,0 mock=1,1,1,1
|
||||
```
|
||||
* The first mock will cause node 0 to exit when calling second Allreduce/Broadcast (seq = 1) in iteration 0
|
||||
* The second mock will cause node 1 to exit when calling second Allreduce/Broadcast (seq = 1) in iteration 1
|
||||
* The second mock will cause node 0 to exit again when calling second Allreduce/Broadcast (seq = 1) in iteration 1
|
||||
- Note that ndeath = 1 means this will happen only if node 0 died once, which is our case
|
||||
|
||||
Running Rabit Jobs
|
||||
====
|
||||
Rabit is a portable library that can run on multiple platforms.
|
||||
|
||||
#### Running Rabit Locally
|
||||
* You can use [../tracker/rabit_demo.py](../tracker/rabit_demo.py) to start n process locally
|
||||
* This script will restart the program when it exits with -2, so it can be used for [mock test](#link-against-mock-test-library)
|
||||
|
||||
#### Running Rabit on Hadoop
|
||||
* You can use [../tracker/rabit_hadoop.py](../tracker/rabit_hadoop.py) to run rabit program on hadoop
|
||||
* This will start n rabit program as mapper of MapReduce
|
||||
* Each program can read its part of data from stdin
|
||||
* Yarn is highly recommended, since Yarn allows specifying ncpu and memory of each mapper
|
||||
- This allows multi-threading programs in each node, which can be more efficient
|
||||
- A good possible practice is OpenMP-rabit hybrid code
|
||||
|
||||
#### Running Rabit on Yarn
|
||||
* To Be modified from [../tracker/rabit_hadoop.py](../tracker/rabit_hadoop.py)
|
||||
|
||||
#### Running Rabit using MPI
|
||||
* You can submit rabit programs to MPI cluster using [../tracker/rabit_mpi.py](../tracker/rabit_mpi.py).
|
||||
* If you linked your code against librabit_mpi.a, then you can directly use mpirun to submit the job
|
||||
|
||||
#### Customize Tracker Script
|
||||
You can also modify the tracker script to allow rabit run on other platforms. To do so, refer to the existing
|
||||
tracker script such as [../tracker/rabit_hadoop.py](../tracker/rabit_hadoop.py) and [../tracker/rabit_mpi.py](../tracker/rabit_mpi.py)
|
||||
|
||||
You will need to implement a platform dependent submission function with the following definition
|
||||
```python
|
||||
def fun_submit(nslave, slave_args):
|
||||
"""
|
||||
customized submit script, that submit nslave jobs,
|
||||
each must contain args as parameter
|
||||
note this can be a lambda closure
|
||||
Parameters
|
||||
nslave number of slave process to start up
|
||||
worker_args tracker information which must be passed to the arguments
|
||||
this usually includes the parameters of master_uri and port etc.
|
||||
"""
|
||||
```
|
||||
The submission function should start nslave process in the platform, and append slave_args to the end of other arguments.
|
||||
Then we can simply call ```tracker.submit``` with fun_submit to submit jobs in the target platform
|
||||
|
||||
Note that the current rabit tracker do not restart a worker when it dies, the job of fail-restart thus lies on the platform itself or we should write
|
||||
fail-restart logic in the customization script.
|
||||
* Fail-restart is usually provided by most platforms.
|
||||
* For example, mapreduce will restart a mapper when it fails
|
||||
|
||||
Fault Tolerance
|
||||
=====
|
||||
@@ -166,15 +262,3 @@ touching the disk. This makes rabit program more reliable and efficient.
|
||||
This is an conceptual introduction to the fault tolerant model of rabit. The actual implementation is more sophiscated,
|
||||
and can deal with more complicated cases such as multiple nodes failure and node failure during recovery phase.
|
||||
|
||||
Running Rabit Jobs
|
||||
====
|
||||
* To run demo locally, use [rabit_demo.py](../tracker/rabit_demo.py)
|
||||
TODO
|
||||
|
||||
Running Rabit on Hadoop
|
||||
====
|
||||
TODO, use [rabit_hadoop.py](../tracker/rabit_hadoop.py)
|
||||
|
||||
Running Rabit using MPI
|
||||
====
|
||||
TODO, use [rabit_mpi.py](../tracker/rabit_mpi.py) or directly use mpirun if compiled with MPI backend.
|
||||
|
||||
Reference in New Issue
Block a user