556 Commits

Author SHA1 Message Date
tqchen
2523288509 basic recovery works 2014-12-03 12:19:08 -08:00
tqchen
8a6768763d bug fixed ver 2014-12-03 11:51:39 -08:00
tqchen
a186f8c3aa ok 2014-12-03 11:19:43 -08:00
tqchen
ceeb6f0690 bug version, check in and rollback 2014-12-03 11:17:39 -08:00
tqchen
f3e5b6e13c ok 2014-12-03 10:00:47 -08:00
tqchen
34f2f887b1 add more broadcast and basic broadcast 2014-12-03 09:59:13 -08:00
nachocano
20b51cc9ce cleaner 2014-12-03 01:44:34 -08:00
nachocano
56aad86231 adding incomplete kmeans.
I'm having a problem with the broadcast, and still need to implement the logic
2014-12-03 01:16:13 -08:00
tqchen
ed1de6df80 change AllReduce to Allreduce 2014-12-02 21:11:48 -08:00
nachocano
8cb5b68cb6 Merge branch 'master' of https://github.com/tqchen/allreduce 2014-12-02 11:28:27 -08:00
nachocano
e4abca9494 changing report folder to doc 2014-12-02 11:28:20 -08:00
tqchen
0a3300d773 rabit run on MPI 2014-12-02 11:20:19 -08:00
nachocano
2fab05c83e adding some design goals. 2014-12-02 11:07:07 -08:00
nachocano
40f7ee1cab adding simple image 2014-12-02 01:49:54 -08:00
nachocano
2c166d7a3a adding some initial skeleton of the report. 2014-12-02 01:19:36 -08:00
tqchen
dcea64c838 check in model recover 2014-12-01 21:41:37 -08:00
tqchen
255218a2f3 change in interface, seems resetlink is still bad 2014-12-01 21:39:51 -08:00
tqchen
b76cd5858c seems ok version 2014-12-01 20:18:25 -08:00
tqchen
46b5d46111 fix one bug, another comes 2014-12-01 19:53:41 -08:00
tqchen
993ff8bb91 find one bug, continue to next one 2014-12-01 19:34:27 -08:00
tqchen
2cde04867f Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-01 16:57:33 -08:00
tqchen
337840d29b recover not yet working 2014-12-01 16:57:26 -08:00
Tianqi Chen
fd2c57b8a4 Update engine_robust.cc 2014-12-01 15:32:57 -08:00
tqchen
1c5167d96e rabit seems ready to run 2014-12-01 10:32:30 -08:00
Tianqi Chen
0d63646015 Update README.md 2014-12-01 10:04:10 -08:00
Tianqi Chen
b5367f48f6 Update README.md 2014-12-01 10:03:45 -08:00
Tianqi Chen
62c8ce9657 Update README.md 2014-12-01 10:03:31 -08:00
tqchen
eb2ca06d67 fresh name fresh start 2014-12-01 09:17:05 -08:00
tqchen
16f729115e checkin allreduce recover 2014-11-30 22:41:04 -08:00
tqchen
9355f5faf2 more conservative exception watching 2014-11-30 21:39:22 -08:00
tqchen
8cef2086f5 smarter select for allreduce and bcast 2014-11-30 21:31:45 -08:00
tqchen
f7928c68a3 next round try more careful select design 2014-11-30 21:07:34 -08:00
tqchen
ecb09a23bc add recover data, do a round of review 2014-11-30 20:59:55 -08:00
tqchen
b9b58a1275 bugfix in decide 2014-11-30 17:48:30 -08:00
tqchen
4a6c01c83c minor change in decide 2014-11-30 17:48:02 -08:00
tqchen
27f6f8ea9e bugfix in msg passing 2014-11-30 17:42:18 -08:00
tqchen
d8d648549f finish message passing, do a review on msg passing and decide 2014-11-30 17:40:30 -08:00
tqchen
38cd595235 check in message passing 2014-11-30 16:38:47 -08:00
tqchen
7a60cb7f3e checkin decide request, todo message passing 2014-11-30 16:37:26 -08:00
tqchen
68f13cd739 tight 2014-11-30 11:46:21 -08:00
tqchen
d1ce3c697c inline 2014-11-30 11:45:50 -08:00
tqchen
2e536eda29 check in the recover strategy 2014-11-30 11:42:59 -08:00
tqchen
155ed3a814 seems a OK version of reset, start to work on decide exec 2014-11-29 22:22:51 -08:00
tqchen
5b0bb53184 refactor code style, reset link still need thoughts 2014-11-29 20:15:27 -08:00
tqchen
42505f473d finish reset link log 2014-11-29 15:14:43 -08:00
tqchen
98756c068a livelock in oob send recv 2014-11-28 21:58:15 -08:00
tqchen
aa54a038f2 livelock in oob send recv 2014-11-28 21:56:58 -08:00
tqchen
a30075794b initial version of robust engine, add discard link, need more random mock test, next milestone will be recovery 2014-11-28 15:56:12 -08:00
nachocano
a8128493c2 execute it like this: ./test.sh 4 4000 testcase0.conf ./
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
2014-11-28 01:48:26 -08:00
nachocano
faed8285cd execute it like ./test.sh 4 4000 testcase0.conf to obtain a successful execution
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
2014-11-28 00:16:35 -08:00