tqchen
1b4921977f
update doc
2015-01-03 05:20:18 -08:00
tqchen
bfb9aa3d77
add native script
2014-12-30 04:37:50 -08:00
tqchen
d64d0ef1dc
cleanup submission script
2014-12-29 06:11:58 -08:00
tqchen
12399a1d42
add more mocktest
2014-12-21 17:59:12 -08:00
tqchen
e40047f9c2
new mock test
2014-12-20 18:38:54 -08:00
tqchen
925d014271
change file structure
2014-12-20 16:19:54 -08:00
tqchen
6151899ce2
add tracker print
2014-12-19 18:40:06 -08:00
tqchen
6bf282c6c2
isolate iserializable
2014-12-19 17:36:42 -08:00
tqchen
8c35cff02c
improve script
2014-12-19 04:21:16 -08:00
tqchen
9f42b78a18
improve tracker script
2014-12-19 04:20:45 -08:00
tqchen
1754fdbf4e
enable support for lambda preprocessing function, and c++11
2014-12-19 02:00:43 -08:00
tqchen
58331067f8
cleanup testcases
2014-12-18 23:50:59 -08:00
tqchen
c8faed0b54
pass local model recover test
2014-12-18 18:53:58 -08:00
tqchen
dbd05a65b5
nice fix, start check local check
2014-12-18 18:39:24 -08:00
tqchen
3f22596e3c
check in license
2014-12-09 20:57:54 -08:00
tqchen
2750679270
normal state running ok
2014-12-07 20:57:29 -08:00
nachocano
20b03e781c
to run all executables
2014-12-06 15:37:09 -08:00
nachocano
fcf2f0a03d
to stderr
2014-12-06 15:22:29 -08:00
nachocano
659b9cd517
changing number of repetitions
2014-12-06 15:14:14 -08:00
nachocano
9ed59e71f6
speed runner
2014-12-06 12:09:40 -08:00
nachocano
e0053c62e1
adding executable
2014-12-06 12:05:08 -08:00
nachocano
8f0d7d1d3e
changing to -ho not to conflict with help
2014-12-06 12:01:05 -08:00
nachocano
771891491c
Merge branch 'master' of https://github.com/tqchen/allreduce
2014-12-06 11:59:22 -08:00
nachocano
f203d13efc
speed runner
2014-12-06 11:59:16 -08:00
tqchen
4a7d84e861
chg string bcast
2014-12-06 11:25:08 -08:00
tqchen
1519f74f3c
ok
2014-12-06 11:20:52 -08:00
tqchen
0e012cb05e
add speed test
2014-12-06 11:05:24 -08:00
tqchen
19631ecef6
more tracker renaming
2014-12-06 09:24:12 -08:00
nachocano
bb7d6814a7
creating initial version of hadoop submit script. Not working.
...
Not sure how to get the master uri and port. I believe I cannot do it before I launch the job.
Updating the name from submit_job to submit_job_mpi
2014-12-05 03:27:02 -08:00
tqchen
90b9f1a98a
add keepalive script
2014-12-03 15:04:30 -08:00
tqchen
7a983a4079
add keepalive
2014-12-03 13:21:30 -08:00
tqchen
8a6768763d
bug fixed ver
2014-12-03 11:51:39 -08:00
tqchen
ed1de6df80
change AllReduce to Allreduce
2014-12-02 21:11:48 -08:00
tqchen
0a3300d773
rabit run on MPI
2014-12-02 11:20:19 -08:00
tqchen
dcea64c838
check in model recover
2014-12-01 21:41:37 -08:00
tqchen
255218a2f3
change in interface, seems resetlink is still bad
2014-12-01 21:39:51 -08:00
tqchen
b76cd5858c
seems ok version
2014-12-01 20:18:25 -08:00
tqchen
46b5d46111
fix one bug, another comes
2014-12-01 19:53:41 -08:00
tqchen
993ff8bb91
find one bug, continue to next one
2014-12-01 19:34:27 -08:00
tqchen
337840d29b
recover not yet working
2014-12-01 16:57:26 -08:00
tqchen
eb2ca06d67
fresh name fresh start
2014-12-01 09:17:05 -08:00
tqchen
8cef2086f5
smarter select for allreduce and bcast
2014-11-30 21:31:45 -08:00
tqchen
5b0bb53184
refactor code style, reset link still need thoughts
2014-11-29 20:15:27 -08:00
tqchen
42505f473d
finish reset link log
2014-11-29 15:14:43 -08:00
tqchen
a30075794b
initial version of robust engine, add discard link, need more random mock test, next milestone will be recovery
2014-11-28 15:56:12 -08:00
nachocano
a8128493c2
execute it like this: ./test.sh 4 4000 testcase0.conf ./
...
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
2014-11-28 01:48:26 -08:00
nachocano
faed8285cd
execute it like ./test.sh 4 4000 testcase0.conf to obtain a successful execution
...
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
2014-11-28 00:16:35 -08:00
nachocano
21f3f3eec4
adding const to variable to comply with google code convention...
...
may need to change more stuff though. Taint what else do you mean? Spaces, tabs, names?
2014-11-27 17:03:31 -08:00
nachocano
c565104491
adding some references to mock inside TEST preprocessor directive.
...
It shouldn't be an assert because it shutdowns the process. Instead should check on the value and return some sort of error, so that we can recover.
The mock contains queues, indexed by the rank of the process. For each node, you can configure the behavior you expect (success or failure for now) when you call any of the methods (AllReduce, Broadcast, LoadCheckPoint and CheckPoint)... If you call several times AllReduce, the outputs will pop from the queue, i.e., first you can retrieve a success, then a failure and so on.
Pretty basic for now, need to tune it better
2014-11-26 17:24:29 -08:00
nachocano
54fcff189f
dummy mock for now
2014-11-26 16:37:23 -08:00