tqchen
ed1de6df80
change AllReduce to Allreduce
2014-12-02 21:11:48 -08:00
nachocano
8cb5b68cb6
Merge branch 'master' of https://github.com/tqchen/allreduce
2014-12-02 11:28:27 -08:00
nachocano
e4abca9494
changing report folder to doc
2014-12-02 11:28:20 -08:00
tqchen
0a3300d773
rabit run on MPI
2014-12-02 11:20:19 -08:00
nachocano
2fab05c83e
adding some design goals.
2014-12-02 11:07:07 -08:00
nachocano
40f7ee1cab
adding simple image
2014-12-02 01:49:54 -08:00
nachocano
2c166d7a3a
adding some initial skeleton of the report.
2014-12-02 01:19:36 -08:00
tqchen
dcea64c838
check in model recover
2014-12-01 21:41:37 -08:00
tqchen
255218a2f3
change in interface, seems resetlink is still bad
2014-12-01 21:39:51 -08:00
tqchen
b76cd5858c
seems ok version
2014-12-01 20:18:25 -08:00
tqchen
46b5d46111
fix one bug, another comes
2014-12-01 19:53:41 -08:00
tqchen
993ff8bb91
find one bug, continue to next one
2014-12-01 19:34:27 -08:00
tqchen
2cde04867f
Merge branch 'master' of ssh://github.com/tqchen/rabit
2014-12-01 16:57:33 -08:00
tqchen
337840d29b
recover not yet working
2014-12-01 16:57:26 -08:00
Tianqi Chen
fd2c57b8a4
Update engine_robust.cc
2014-12-01 15:32:57 -08:00
tqchen
1c5167d96e
rabit seems ready to run
2014-12-01 10:32:30 -08:00
Tianqi Chen
0d63646015
Update README.md
2014-12-01 10:04:10 -08:00
Tianqi Chen
b5367f48f6
Update README.md
2014-12-01 10:03:45 -08:00
Tianqi Chen
62c8ce9657
Update README.md
2014-12-01 10:03:31 -08:00
tqchen
eb2ca06d67
fresh name fresh start
2014-12-01 09:17:05 -08:00
tqchen
16f729115e
checkin allreduce recover
2014-11-30 22:41:04 -08:00
tqchen
9355f5faf2
more conservative exception watching
2014-11-30 21:39:22 -08:00
tqchen
8cef2086f5
smarter select for allreduce and bcast
2014-11-30 21:31:45 -08:00
tqchen
f7928c68a3
next round try more careful select design
2014-11-30 21:07:34 -08:00
tqchen
ecb09a23bc
add recover data, do a round of review
2014-11-30 20:59:55 -08:00
tqchen
b9b58a1275
bugfix in decide
2014-11-30 17:48:30 -08:00
tqchen
4a6c01c83c
minor change in decide
2014-11-30 17:48:02 -08:00
tqchen
27f6f8ea9e
bugfix in msg passing
2014-11-30 17:42:18 -08:00
tqchen
d8d648549f
finish message passing, do a review on msg passing and decide
2014-11-30 17:40:30 -08:00
tqchen
38cd595235
check in message passing
2014-11-30 16:38:47 -08:00
tqchen
7a60cb7f3e
checkin decide request, todo message passing
2014-11-30 16:37:26 -08:00
tqchen
68f13cd739
tight
2014-11-30 11:46:21 -08:00
tqchen
d1ce3c697c
inline
2014-11-30 11:45:50 -08:00
tqchen
2e536eda29
check in the recover strategy
2014-11-30 11:42:59 -08:00
tqchen
155ed3a814
seems a OK version of reset, start to work on decide exec
2014-11-29 22:22:51 -08:00
tqchen
5b0bb53184
refactor code style, reset link still need thoughts
2014-11-29 20:15:27 -08:00
tqchen
42505f473d
finish reset link log
2014-11-29 15:14:43 -08:00
tqchen
98756c068a
livelock in oob send recv
2014-11-28 21:58:15 -08:00
tqchen
aa54a038f2
livelock in oob send recv
2014-11-28 21:56:58 -08:00
tqchen
a30075794b
initial version of robust engine, add discard link, need more random mock test, next milestone will be recovery
2014-11-28 15:56:12 -08:00
nachocano
a8128493c2
execute it like this: ./test.sh 4 4000 testcase0.conf ./
...
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
2014-11-28 01:48:26 -08:00
nachocano
faed8285cd
execute it like ./test.sh 4 4000 testcase0.conf to obtain a successful execution
...
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
2014-11-28 00:16:35 -08:00
nachocano
21f3f3eec4
adding const to variable to comply with google code convention...
...
may need to change more stuff though. Taint what else do you mean? Spaces, tabs, names?
2014-11-27 17:03:31 -08:00
tqchen
2f1ba40786
change in socket, to pass out error code
2014-11-27 16:17:07 -08:00
nachocano
c565104491
adding some references to mock inside TEST preprocessor directive.
...
It shouldn't be an assert because it shutdowns the process. Instead should check on the value and return some sort of error, so that we can recover.
The mock contains queues, indexed by the rank of the process. For each node, you can configure the behavior you expect (success or failure for now) when you call any of the methods (AllReduce, Broadcast, LoadCheckPoint and CheckPoint)... If you call several times AllReduce, the outputs will pop from the queue, i.e., first you can retrieve a success, then a failure and so on.
Pretty basic for now, need to tune it better
2014-11-26 17:24:29 -08:00
nachocano
54fcff189f
dummy mock for now
2014-11-26 16:37:23 -08:00
tqchen
d37f38c455
initial version of allreduce
2014-11-25 16:15:56 -08:00
Tianqi Chen
5e5bdda491
Initial commit
2014-11-25 14:37:18 -08:00