tqchen
337840d29b
recover not yet working
2014-12-01 16:57:26 -08:00
Tianqi Chen
fd2c57b8a4
Update engine_robust.cc
2014-12-01 15:32:57 -08:00
tqchen
1c5167d96e
rabit seems ready to run
2014-12-01 10:32:30 -08:00
Tianqi Chen
0d63646015
Update README.md
2014-12-01 10:04:10 -08:00
Tianqi Chen
b5367f48f6
Update README.md
2014-12-01 10:03:45 -08:00
Tianqi Chen
62c8ce9657
Update README.md
2014-12-01 10:03:31 -08:00
tqchen
eb2ca06d67
fresh name fresh start
2014-12-01 09:17:05 -08:00
tqchen
16f729115e
checkin allreduce recover
2014-11-30 22:41:04 -08:00
tqchen
9355f5faf2
more conservative exception watching
2014-11-30 21:39:22 -08:00
tqchen
8cef2086f5
smarter select for allreduce and bcast
2014-11-30 21:31:45 -08:00
tqchen
f7928c68a3
next round try more careful select design
2014-11-30 21:07:34 -08:00
tqchen
ecb09a23bc
add recover data, do a round of review
2014-11-30 20:59:55 -08:00
tqchen
b9b58a1275
bugfix in decide
2014-11-30 17:48:30 -08:00
tqchen
4a6c01c83c
minor change in decide
2014-11-30 17:48:02 -08:00
tqchen
27f6f8ea9e
bugfix in msg passing
2014-11-30 17:42:18 -08:00
tqchen
d8d648549f
finish message passing, do a review on msg passing and decide
2014-11-30 17:40:30 -08:00
tqchen
38cd595235
check in message passing
2014-11-30 16:38:47 -08:00
tqchen
7a60cb7f3e
checkin decide request, todo message passing
2014-11-30 16:37:26 -08:00
tqchen
68f13cd739
tight
2014-11-30 11:46:21 -08:00
tqchen
d1ce3c697c
inline
2014-11-30 11:45:50 -08:00
tqchen
2e536eda29
check in the recover strategy
2014-11-30 11:42:59 -08:00
tqchen
155ed3a814
seems a OK version of reset, start to work on decide exec
2014-11-29 22:22:51 -08:00
tqchen
5b0bb53184
refactor code style, reset link still need thoughts
2014-11-29 20:15:27 -08:00
tqchen
42505f473d
finish reset link log
2014-11-29 15:14:43 -08:00
tqchen
98756c068a
livelock in oob send recv
2014-11-28 21:58:15 -08:00
tqchen
aa54a038f2
livelock in oob send recv
2014-11-28 21:56:58 -08:00
tqchen
a30075794b
initial version of robust engine, add discard link, need more random mock test, next milestone will be recovery
2014-11-28 15:56:12 -08:00
nachocano
a8128493c2
execute it like this: ./test.sh 4 4000 testcase0.conf ./
...
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
2014-11-28 01:48:26 -08:00
nachocano
faed8285cd
execute it like ./test.sh 4 4000 testcase0.conf to obtain a successful execution
...
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
2014-11-28 00:16:35 -08:00
nachocano
21f3f3eec4
adding const to variable to comply with google code convention...
...
may need to change more stuff though. Taint what else do you mean? Spaces, tabs, names?
2014-11-27 17:03:31 -08:00
tqchen
2f1ba40786
change in socket, to pass out error code
2014-11-27 16:17:07 -08:00
nachocano
c565104491
adding some references to mock inside TEST preprocessor directive.
...
It shouldn't be an assert because it shutdowns the process. Instead should check on the value and return some sort of error, so that we can recover.
The mock contains queues, indexed by the rank of the process. For each node, you can configure the behavior you expect (success or failure for now) when you call any of the methods (AllReduce, Broadcast, LoadCheckPoint and CheckPoint)... If you call several times AllReduce, the outputs will pop from the queue, i.e., first you can retrieve a success, then a failure and so on.
Pretty basic for now, need to tune it better
2014-11-26 17:24:29 -08:00
nachocano
54fcff189f
dummy mock for now
2014-11-26 16:37:23 -08:00
Tianqi Chen
5ae99372d6
Update simple_dmatrix-inl.hpp
2014-11-26 09:13:49 -08:00
Tianqi Chen
be5fb800d5
Merge pull request #112 from tfgit/master
...
Fixed README
2014-11-25 19:29:41 -08:00
Ted Fujimoto
baf41d589d
Fixed README
2014-11-25 22:17:36 -05:00
Tianqi Chen
8d7dbc65b3
Merge pull request #111 from tfgit/master
...
OS X OpenMP support instructions
2014-11-25 19:12:42 -08:00
Ted Fujimoto
198489438f
Added OS X OpenMP instructions
2014-11-25 21:42:13 -05:00
Ted Fujimoto
c356a0acc2
Remove tools folder
2014-11-25 21:27:50 -05:00
tqchen
d37f38c455
initial version of allreduce
2014-11-25 16:15:56 -08:00
Tianqi Chen
5e5bdda491
Initial commit
2014-11-25 14:37:18 -08:00
Tianqi Chen
cdcfa5687a
Update socket.h
2014-11-23 22:46:57 -08:00
tqchen
f53be2884a
ok
2014-11-23 22:42:44 -08:00
Tianqi Chen
f805ecb5f3
fix a bug in node sindex set
2014-11-23 22:35:30 -08:00
tqchen
3e162ceda6
windows strange
2014-11-23 22:21:15 -08:00
tqchen
35bf2101fe
seems a prob in win
2014-11-23 22:18:28 -08:00
Tianqi Chen
fde580b08e
fix windows run
2014-11-23 22:12:55 -08:00
tqchen
77ffd0465b
ok
2014-11-23 21:36:22 -08:00
tqchen
78ca72b9c7
start work on win
2014-11-23 21:34:15 -08:00
tqchen
d2f151ef5a
bring it back alive again
2014-11-23 21:27:16 -08:00