tqchen
0e012cb05e
add speed test
2014-12-06 11:05:24 -08:00
tqchen
19631ecef6
more tracker renaming
2014-12-06 09:24:12 -08:00
tqchen
dc12958fc7
rename master to tracker, to emphasie rabit is p2p in computing
2014-12-06 09:15:31 -08:00
tqchen
7765e2dc55
add status report
2014-12-05 09:49:26 -08:00
tqchen
ab278513ab
ok
2014-12-05 09:39:51 -08:00
tqchen
821eb21ae2
before make rabit public
2014-12-04 17:30:58 -08:00
tqchen
cc410b8c90
add local model in checkpoint interface, a new goal
2014-12-04 11:09:15 -08:00
nachocano
5c23b94069
updating kmeans based on Tianqi feedback. More efficient now
2014-12-03 15:38:58 -08:00
nachocano
55c2a5dc83
Merge branch 'master' of https://github.com/tqchen/allreduce
2014-12-03 14:21:42 -08:00
nachocano
1d0d5bb141
kmeans seems to be working.. not restarting anything though
2014-12-03 14:21:10 -08:00
tqchen
7a983a4079
add keepalive
2014-12-03 13:21:30 -08:00
tqchen
2523288509
basic recovery works
2014-12-03 12:19:08 -08:00
tqchen
8a6768763d
bug fixed ver
2014-12-03 11:51:39 -08:00
tqchen
a186f8c3aa
ok
2014-12-03 11:19:43 -08:00
tqchen
ceeb6f0690
bug version, check in and rollback
2014-12-03 11:17:39 -08:00
tqchen
f3e5b6e13c
ok
2014-12-03 10:00:47 -08:00
tqchen
34f2f887b1
add more broadcast and basic broadcast
2014-12-03 09:59:13 -08:00
tqchen
ed1de6df80
change AllReduce to Allreduce
2014-12-02 21:11:48 -08:00
tqchen
0a3300d773
rabit run on MPI
2014-12-02 11:20:19 -08:00
tqchen
255218a2f3
change in interface, seems resetlink is still bad
2014-12-01 21:39:51 -08:00
tqchen
b76cd5858c
seems ok version
2014-12-01 20:18:25 -08:00
tqchen
46b5d46111
fix one bug, another comes
2014-12-01 19:53:41 -08:00
tqchen
993ff8bb91
find one bug, continue to next one
2014-12-01 19:34:27 -08:00
tqchen
2cde04867f
Merge branch 'master' of ssh://github.com/tqchen/rabit
2014-12-01 16:57:33 -08:00
tqchen
337840d29b
recover not yet working
2014-12-01 16:57:26 -08:00
Tianqi Chen
fd2c57b8a4
Update engine_robust.cc
2014-12-01 15:32:57 -08:00
tqchen
1c5167d96e
rabit seems ready to run
2014-12-01 10:32:30 -08:00
tqchen
eb2ca06d67
fresh name fresh start
2014-12-01 09:17:05 -08:00
tqchen
16f729115e
checkin allreduce recover
2014-11-30 22:41:04 -08:00
tqchen
9355f5faf2
more conservative exception watching
2014-11-30 21:39:22 -08:00
tqchen
8cef2086f5
smarter select for allreduce and bcast
2014-11-30 21:31:45 -08:00
tqchen
f7928c68a3
next round try more careful select design
2014-11-30 21:07:34 -08:00
tqchen
ecb09a23bc
add recover data, do a round of review
2014-11-30 20:59:55 -08:00
tqchen
b9b58a1275
bugfix in decide
2014-11-30 17:48:30 -08:00
tqchen
4a6c01c83c
minor change in decide
2014-11-30 17:48:02 -08:00
tqchen
27f6f8ea9e
bugfix in msg passing
2014-11-30 17:42:18 -08:00
tqchen
d8d648549f
finish message passing, do a review on msg passing and decide
2014-11-30 17:40:30 -08:00
tqchen
38cd595235
check in message passing
2014-11-30 16:38:47 -08:00
tqchen
7a60cb7f3e
checkin decide request, todo message passing
2014-11-30 16:37:26 -08:00
tqchen
68f13cd739
tight
2014-11-30 11:46:21 -08:00
tqchen
d1ce3c697c
inline
2014-11-30 11:45:50 -08:00
tqchen
2e536eda29
check in the recover strategy
2014-11-30 11:42:59 -08:00
tqchen
155ed3a814
seems a OK version of reset, start to work on decide exec
2014-11-29 22:22:51 -08:00
tqchen
5b0bb53184
refactor code style, reset link still need thoughts
2014-11-29 20:15:27 -08:00
tqchen
42505f473d
finish reset link log
2014-11-29 15:14:43 -08:00
tqchen
98756c068a
livelock in oob send recv
2014-11-28 21:58:15 -08:00
tqchen
aa54a038f2
livelock in oob send recv
2014-11-28 21:56:58 -08:00
tqchen
a30075794b
initial version of robust engine, add discard link, need more random mock test, next milestone will be recovery
2014-11-28 15:56:12 -08:00
nachocano
a8128493c2
execute it like this: ./test.sh 4 4000 testcase0.conf ./
...
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
2014-11-28 01:48:26 -08:00
nachocano
faed8285cd
execute it like ./test.sh 4 4000 testcase0.conf to obtain a successful execution
...
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
2014-11-28 00:16:35 -08:00