81 Commits

Author SHA1 Message Date
Jiaming Yuan
4acdd7c6f6
Remove stop process. (#143) 2020-08-05 10:12:00 -07:00
Philip Hyunsu Cho
74bf00a5ab
De-duplicate macro _CRT_SECURE_NO_WARNINGS / _CRT_SECURE_NO_DEPRECATE (#136)
* De-duplicate macro _CRT_SECURE_NO_WARNINGS / _CRT_SECURE_NO_DEPRECATE

* Move all macros to base.h

* Fix CI
2020-06-28 09:51:50 -07:00
nateagr
1907b25cd0 Expose RabitAllGatherRing and RabitGetRingPrevRank (#113)
* add unittests

* Expose RabitAllGatherRing and RabitGetRingPrevRank

* Enabled TCP_NODELAY to decrease latency
2019-11-12 19:55:32 +08:00
Jiaming Yuan
6dab74689c
Add SeekEnd to MemoryFixSizeBuffer. (#109)
* Don't assert buffer size.
2019-10-13 00:09:25 -04:00
Chen Qin
5d1b613910 exit when allreduce/broadcast error cause timeout (#112)
* keep async timeout task

* add missing pthread to cmake

* add tests

* Add a sleep period to avoid flushing the tracker.
2019-10-11 03:39:39 -04:00
Chen Qin
af7281afe3 unittests mock, cleanup (#111)
* cleanup, fix issue involved after remove is_bootstrap parameter

* misc

* clean

* add unittests
2019-10-01 13:36:11 -07:00
Chen Qin
ddcc2d85da Clean up cmake script and code includes (#106)
* Clean up CMake scripts and related include paths.
* Add unittests.
2019-09-26 02:29:04 -04:00
Chen Qin
9a7ac85d7e remove is_bootstrap parameter (#102)
* apply openmp simd

* clean __buildin detection, moving windows build check from xgboost project, add openmp support for vectorize reduce

* apply openmp only to rabit

* orgnize rabit signature

* remove is_bootstrap, use load_checkpoint as implict flag

* visual studio don't support latest openmp

* orgnize omp declarations

* replace memory copy with vector cast

* Revert "replace memory copy with vector cast"

This reverts commit 28de4792dcdff40d83d458510d23b7ef0b191d79.

* Revert "orgnize omp declarations"

This reverts commit 31341233d31ce93ccf34d700262b1f3f6690bbfe.

* remove openmp settings, merge into a upcoming pr

* mis

* per feedback, update comments
2019-09-10 11:45:50 -07:00
Chen Qin
5797dcb64e support bootstrap allreduce/broadcast (#98)
* support run rabit tests as xgboost subproject using xgboost/dmlc-core

* support tracker config set/get

* remove redudant printf

* remove redudant printf

* add c++0x declaration

* log allreduce/broadcast caller, engine should track caller stack for
investigation

* tracker support binary config format

* Revert "tracker support binary config format"

This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.

* remove caller, prototype fetch allreduce/broadcast results from resbuf

* store cached allreduce/broadcast seq_no to tracker

* allow restore all caches from other nodes

* try new rabit collective cache, todo: recv_link seems down

* link up cache restore with main recovery

* cleanup load cache state

* update cache api

* pass test.mk

* have a working tests

* try to unify check into actionsummary

* more logging to debug distributed hist three method issue

* update rabit interface to support caller signature matching

* splite seq_counter from cur_cache_seq to different variables

* still see issue with inf loop

* support debug print caller as well as allreduce op

* cleanup

* remove get/set cache from model_recover, adding recover in
loadcheckpoint

* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries

* revert caller logs

* fix lint error

* fix engine mpi signature

* support getcache by ref

* allow result buffer presiet to filestream

* add loging

* try fix checkpoint failure recovery case

* use int64_t to avoid overflow caused seq fault

* try avoid int overflow

* try fix checkpoint failure recovery case

* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery

* fix cache seq assert error

* remove loging, handle edge case

* add extensive log to checkpoint state  with different seq no

* fix lint errors

* clean up comments before merge back to master

* add logs to allreduce/broadcast/checkpoint

* use unsinged int 32 and give seq no larger range

* address remove allreduce dropseq code segment

* using caller signature to filter bootstrapallreduces

* remove get/set cache from empty

* apply signature to reducer

* apply signature to broadcast

* add key to broadcat log

* fix broadcast signature

* fix default _line value for non linux system

* adding comments, remove sleep(1)

* fix osx build issue

* try fix mpi

* fix doc

* fix engine_empty api

* logging, adding more logs, restore immutable assertion

* print unsinged int with ud

* fix lint

* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine

* comment allreduce/broadcast log

* allow tests run on arm

* enable flag to turn on / off cache

* add log info alert if user choose to enable rabit bootstrap cache

* add rabit_debug setting so user can use config to turn on

* log flags when user turn on rabit_debug

* force rabit restart if tracker assign -1 rank

* use OPENMP to vecotrize reducer

* address comment

* Revert "address comment"

This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.

* fix checkpoint size print 0

* per feedback, remove DISABLEOPEMP, address race condition

* - remove openmp from this pr
- update name from cache to boostrapcache

* add default value of signature macros

* remove openmp from cmake file

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* run test with cmake

* remove openmp

* fix cmake based tests

* use cmake test fix darwin .dylib issue

* move around rabit_signature definition due to windows build

* misc, add c++ check in CMakeFile

* per feedback

* resolve CMake file

* update rabit version
2019-08-27 18:12:33 -07:00
Chen Qin
e3d51d3e62 [rabit harden] Enable all tests (#90)
* include osx in tests
* address `time_wait` on port assignment
* increase submit attempts.
* cleanup tests
2019-04-24 19:12:11 +08:00
Chen Qin
ecd4bf7aae [rabit harden] replace hardcopy dmlc-core headers with submodule links (#86)
* backport dmlc header changes to rabit

* use gitmodule to reference latest dmlc header files

* include ref to dmlc-core
fix cmake

* update cmake file, add cmake build traivs task

* try force using g++-4.8

* per feedback, update cmake
2019-03-23 13:11:29 +08:00
Chen Qin
785d7e54d3 [mpi] add engine_mpi travis build (#83) 2019-03-15 22:58:47 +08:00
Chen Qin
ed06e0c6af [rabit harden] fix rabit tests (#81)
* enable model recovery tests
* force use gcc4.8 in Travis
2019-03-15 07:16:45 +08:00
Chen Qin
3a35dabfae support larger cluster (#73)
* fix error in dmlc#57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master
2018-10-22 10:13:45 -07:00
Dennis O'Brien
440e81db0b Fixed print statements and xrange to be compatibile with Python 2 and 3. (#55) 2018-02-26 12:19:04 -08:00
tqchen
e3188afbe8 fix 2016-02-28 13:09:18 -08:00
tqchen
26c87ec6e7 fix test 2016-02-28 09:35:08 -08:00
tqchen
7479791f6a refactor: librabit 2016-02-27 10:14:41 -08:00
Tianqi Chen
7addac910b Update Makefile 2015-07-03 15:23:26 -07:00
tqchen
3cc49ad0e8 lint and travis 2015-07-03 15:15:11 -07:00
tqchen
e95c96232a remove I prefix from interface, serializable now takes in pointer 2015-04-08 15:25:58 -07:00
tqchen
c57dad8b17 add ringbased passing and batch schedule 2015-03-11 12:00:19 -07:00
tqchen
6d5ac6446c chg test folder 2015-01-15 10:24:58 -08:00
tqchen
8f23eb11d7 change convention 2015-01-15 10:22:59 -08:00
tqchen
6dbaddd2b9 ok 2015-01-14 22:11:00 -08:00
tqchen
968b33ec79 set all tracker thread to deamon 2015-01-14 12:05:00 -08:00
tqchen
87c7817124 add lazy check, need test, find a race condition 2015-01-14 11:58:43 -08:00
tqchen
348a1e7619 change default behavior to behave normal 2015-01-13 22:21:15 -08:00
tqchen
532575b752 ok 2015-01-13 14:41:37 -08:00
tqchen
3419cf9aa7 add auto caching of python in hadoop script, mock test module to python, with checkpt 2015-01-13 14:29:10 -08:00
tqchen
1b4921977f update doc 2015-01-03 05:20:18 -08:00
tqchen
bfb9aa3d77 add native script 2014-12-30 04:37:50 -08:00
tqchen
d64d0ef1dc cleanup submission script 2014-12-29 06:11:58 -08:00
tqchen
12399a1d42 add more mocktest 2014-12-21 17:59:12 -08:00
tqchen
e40047f9c2 new mock test 2014-12-20 18:38:54 -08:00
tqchen
925d014271 change file structure 2014-12-20 16:19:54 -08:00
tqchen
6151899ce2 add tracker print 2014-12-19 18:40:06 -08:00
tqchen
6bf282c6c2 isolate iserializable 2014-12-19 17:36:42 -08:00
tqchen
8c35cff02c improve script 2014-12-19 04:21:16 -08:00
tqchen
9f42b78a18 improve tracker script 2014-12-19 04:20:45 -08:00
tqchen
1754fdbf4e enable support for lambda preprocessing function, and c++11 2014-12-19 02:00:43 -08:00
tqchen
58331067f8 cleanup testcases 2014-12-18 23:50:59 -08:00
tqchen
c8faed0b54 pass local model recover test 2014-12-18 18:53:58 -08:00
tqchen
dbd05a65b5 nice fix, start check local check 2014-12-18 18:39:24 -08:00
tqchen
3f22596e3c check in license 2014-12-09 20:57:54 -08:00
tqchen
2750679270 normal state running ok 2014-12-07 20:57:29 -08:00
nachocano
20b03e781c to run all executables 2014-12-06 15:37:09 -08:00
nachocano
fcf2f0a03d to stderr 2014-12-06 15:22:29 -08:00
nachocano
659b9cd517 changing number of repetitions 2014-12-06 15:14:14 -08:00
nachocano
9ed59e71f6 speed runner 2014-12-06 12:09:40 -08:00