Squashed 'subtree/rabit/' changes from d4ec037..28ca7be

28ca7be add linear readme
ca4b20f add linear readme
1133628 add linear readme
6a11676 update docs
a607047 Update build.sh
2c1cfd8 complete yarn
4f28e32 change formater
2fbda81 fix stdin input
3258bcf checkin yarn master
67ebf81 allow setup from env variables
9b6bf57 fix hdfs
395d5c2 add make system
88ce767 refactor io, initial hdfs file access need test
19be870 chgs
a1bd3c6 Merge branch 'master' of ssh://github.com/tqchen/rabit
1a573f9 introduce input split
29476f1 fix timer issue

git-subtree-dir: subtree/rabit
git-subtree-split: 28ca7becbd
This commit is contained in:
tqchen
2015-03-09 13:28:38 -07:00
parent ef2de29f06
commit 57b5d7873f
43 changed files with 1797 additions and 235 deletions

View File

@@ -341,12 +341,11 @@ Rabit is a portable library that can run on multiple platforms.
* This script will restart the program when it exits with -2, so it can be used for [mock test](#link-against-mock-test-library)
#### Running Rabit on Hadoop
* You can use [../tracker/rabit_hadoop.py](../tracker/rabit_hadoop.py) to run rabit programs on hadoop
* This will start n rabit programs as mappers of MapReduce
* Each program can read its portion of data from stdin
* Yarn(Hadoop 2.0 or higher) is highly recommended, since Yarn allows specifying number of cpus and memory of each mapper:
* You can use [../tracker/rabit_yarn.py](../tracker/rabit_yarn.py) to run rabit programs as Yarn application
* This will start rabit programs as yarn applications
- This allows multi-threading programs in each node, which can be more efficient
- An easy multi-threading solution could be to use OpenMP with rabit code
* It is also possible to run rabit program via hadoop streaming, however, YARN is highly recommended.
#### Running Rabit using MPI
* You can submit rabit programs to an MPI cluster using [../tracker/rabit_mpi.py](../tracker/rabit_mpi.py).
@@ -358,15 +357,15 @@ tracker scripts, such as [../tracker/rabit_hadoop.py](../tracker/rabit_hadoop.py
You will need to implement a platform dependent submission function with the following definition
```python
def fun_submit(nworkers, worker_args):
def fun_submit(nworkers, worker_args, worker_envs):
"""
customized submit script, that submits nslave jobs,
each must contain args as parameter
note this can be a lambda closure
Parameters
nworkers number of worker processes to start
worker_args tracker information which must be passed to the arguments
this usually includes the parameters of master_uri and port, etc.
worker_args addtiional arguments that needs to be passed to worker
worker_envs enviroment variables that need to be set to the worker
"""
```
The submission function should start nworkers processes in the platform, and append worker_args to the end of the other arguments.
@@ -374,7 +373,7 @@ Then you can simply call ```tracker.submit``` with fun_submit to submit jobs to
Note that the current rabit tracker does not restart a worker when it dies, the restart of a node is done by the platform, otherwise we should write the fail-restart logic in the custom script.
* Fail-restart is usually provided by most platforms.
* For example, mapreduce will restart a mapper when it fails
- rabit-yarn provides such functionality in YARN
Fault Tolerance
=====