61 Commits

Author SHA1 Message Date
Jiaming Yuan
4acdd7c6f6
Remove stop process. (#143) 2020-08-05 10:12:00 -07:00
Philip Hyunsu Cho
74bf00a5ab
De-duplicate macro _CRT_SECURE_NO_WARNINGS / _CRT_SECURE_NO_DEPRECATE (#136)
* De-duplicate macro _CRT_SECURE_NO_WARNINGS / _CRT_SECURE_NO_DEPRECATE

* Move all macros to base.h

* Fix CI
2020-06-28 09:51:50 -07:00
Philip Hyunsu Cho
2f7fcff4d7
Fix build on FreeBSD (#133) 2020-01-27 12:15:32 -08:00
Nan Zhu
6e563951af
fix hanging trainings (#132)
* fix hanging connections

* remove logging
2020-01-27 09:12:02 -08:00
Chen Qin
493ad834a1 allow duplicated bootstrap allreduce overwrite previous results (#128)
* allow timeout to 0 to eanble immediate exit

* disable duplicated signature check, overwrite results with same key
2019-11-13 10:19:58 +08:00
nateagr
1907b25cd0 Expose RabitAllGatherRing and RabitGetRingPrevRank (#113)
* add unittests

* Expose RabitAllGatherRing and RabitGetRingPrevRank

* Enabled TCP_NODELAY to decrease latency
2019-11-12 19:55:32 +08:00
Chen Qin
5d1b613910 exit when allreduce/broadcast error cause timeout (#112)
* keep async timeout task

* add missing pthread to cmake

* add tests

* Add a sleep period to avoid flushing the tracker.
2019-10-11 03:39:39 -04:00
Chen Qin
af7281afe3 unittests mock, cleanup (#111)
* cleanup, fix issue involved after remove is_bootstrap parameter

* misc

* clean

* add unittests
2019-10-01 13:36:11 -07:00
Chen Qin
ddcc2d85da Clean up cmake script and code includes (#106)
* Clean up CMake scripts and related include paths.
* Add unittests.
2019-09-26 02:29:04 -04:00
Xu Xiao
e92641887b remove unreached code of AllreduceRobust::CheckAndRecover (#108) 2019-09-18 23:06:59 -04:00
Chen Qin
9a7ac85d7e remove is_bootstrap parameter (#102)
* apply openmp simd

* clean __buildin detection, moving windows build check from xgboost project, add openmp support for vectorize reduce

* apply openmp only to rabit

* orgnize rabit signature

* remove is_bootstrap, use load_checkpoint as implict flag

* visual studio don't support latest openmp

* orgnize omp declarations

* replace memory copy with vector cast

* Revert "replace memory copy with vector cast"

This reverts commit 28de4792dcdff40d83d458510d23b7ef0b191d79.

* Revert "orgnize omp declarations"

This reverts commit 31341233d31ce93ccf34d700262b1f3f6690bbfe.

* remove openmp settings, merge into a upcoming pr

* mis

* per feedback, update comments
2019-09-10 11:45:50 -07:00
Chen Qin
5797dcb64e support bootstrap allreduce/broadcast (#98)
* support run rabit tests as xgboost subproject using xgboost/dmlc-core

* support tracker config set/get

* remove redudant printf

* remove redudant printf

* add c++0x declaration

* log allreduce/broadcast caller, engine should track caller stack for
investigation

* tracker support binary config format

* Revert "tracker support binary config format"

This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.

* remove caller, prototype fetch allreduce/broadcast results from resbuf

* store cached allreduce/broadcast seq_no to tracker

* allow restore all caches from other nodes

* try new rabit collective cache, todo: recv_link seems down

* link up cache restore with main recovery

* cleanup load cache state

* update cache api

* pass test.mk

* have a working tests

* try to unify check into actionsummary

* more logging to debug distributed hist three method issue

* update rabit interface to support caller signature matching

* splite seq_counter from cur_cache_seq to different variables

* still see issue with inf loop

* support debug print caller as well as allreduce op

* cleanup

* remove get/set cache from model_recover, adding recover in
loadcheckpoint

* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries

* revert caller logs

* fix lint error

* fix engine mpi signature

* support getcache by ref

* allow result buffer presiet to filestream

* add loging

* try fix checkpoint failure recovery case

* use int64_t to avoid overflow caused seq fault

* try avoid int overflow

* try fix checkpoint failure recovery case

* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery

* fix cache seq assert error

* remove loging, handle edge case

* add extensive log to checkpoint state  with different seq no

* fix lint errors

* clean up comments before merge back to master

* add logs to allreduce/broadcast/checkpoint

* use unsinged int 32 and give seq no larger range

* address remove allreduce dropseq code segment

* using caller signature to filter bootstrapallreduces

* remove get/set cache from empty

* apply signature to reducer

* apply signature to broadcast

* add key to broadcat log

* fix broadcast signature

* fix default _line value for non linux system

* adding comments, remove sleep(1)

* fix osx build issue

* try fix mpi

* fix doc

* fix engine_empty api

* logging, adding more logs, restore immutable assertion

* print unsinged int with ud

* fix lint

* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine

* comment allreduce/broadcast log

* allow tests run on arm

* enable flag to turn on / off cache

* add log info alert if user choose to enable rabit bootstrap cache

* add rabit_debug setting so user can use config to turn on

* log flags when user turn on rabit_debug

* force rabit restart if tracker assign -1 rank

* use OPENMP to vecotrize reducer

* address comment

* Revert "address comment"

This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.

* fix checkpoint size print 0

* per feedback, remove DISABLEOPEMP, address race condition

* - remove openmp from this pr
- update name from cache to boostrapcache

* add default value of signature macros

* remove openmp from cmake file

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* run test with cmake

* remove openmp

* fix cmake based tests

* use cmake test fix darwin .dylib issue

* move around rabit_signature definition due to windows build

* misc, add c++ check in CMakeFile

* per feedback

* resolve CMake file

* update rabit version
2019-08-27 18:12:33 -07:00
Nan Zhu
65b718a5e7
return values in Init and Finalize (#96)
* make inti function return values

* address the comments
2019-06-25 20:05:54 -07:00
Chen Qin
e3d51d3e62 [rabit harden] Enable all tests (#90)
* include osx in tests
* address `time_wait` on port assignment
* increase submit attempts.
* cleanup tests
2019-04-24 19:12:11 +08:00
Chen Qin
ecd4bf7aae [rabit harden] replace hardcopy dmlc-core headers with submodule links (#86)
* backport dmlc header changes to rabit

* use gitmodule to reference latest dmlc header files

* include ref to dmlc-core
fix cmake

* update cmake file, add cmake build traivs task

* try force using g++-4.8

* per feedback, update cmake
2019-03-23 13:11:29 +08:00
Chen Qin
ed06e0c6af [rabit harden] fix rabit tests (#81)
* enable model recovery tests
* force use gcc4.8 in Travis
2019-03-15 07:16:45 +08:00
Jiaming Yuan
1cc34f01db
Fix ssize_t definition. (#80)
* Fix linter.
2019-02-18 19:25:08 +08:00
Jiaming Yuan
05941a5f96
Try fixing mingw build error when using CMake. (#77)
* Try fixing mingw build error when using CMake.

* Check __MINGW32__ .

* Fix linter.
2019-02-16 22:35:43 +08:00
Chen Qin
eb2590b774 workaround macosx java test race condition (#74)
* fix error in dmlc#57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master

* fix mac osx test failure in https://github.com/dmlc/xgboost/pull/3818

* Update allreduce_robust.cc
2018-10-26 12:39:31 -07:00
Chen Qin
3a35dabfae support larger cluster (#73)
* fix error in dmlc#57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master
2018-10-22 10:13:45 -07:00
AbdealiJK
21b5e12913 allreduce_robust.cc: Allow num_global_replica to be 0 (#38)
In some cases, users may not want to have any global replica of
the data being broadcasted/all-reduced. In such cases, set the
result_buffer_round to -1 as a flag that this is not necessary
and check for it.
2016-11-23 19:34:11 -08:00
tqchen
e19fced5cb [FIX] rabit on single node 2016-05-10 20:05:59 -07:00
tqchen
be50e7b632 Make rabit library thread local 2016-03-01 20:12:51 -08:00
tqchen
7479791f6a refactor: librabit 2016-02-27 10:14:41 -08:00
tqchen
112d866dc9 [RABIT] fix rabit in local mode 2016-01-12 21:34:26 -08:00
tqchen
3cc49ad0e8 lint and travis 2015-07-03 15:15:11 -07:00
Tianqi Chen
fd8920c71d fix win32 2015-05-28 12:24:26 -07:00
tqchen
e95c96232a remove I prefix from interface, serializable now takes in pointer 2015-04-08 15:25:58 -07:00
tqchen
146e069000 bugfix: logical boundary for ring buffer 2015-03-11 20:28:34 -07:00
tqchen
67ebf81e7a allow setup from env variables 2015-03-07 16:45:31 -08:00
tqchen
4db0a62a06 bugfix of lazy prepare 2015-02-11 20:31:46 -08:00
tqchen
1db6449b01 remove include in -I, make things easier to direct compile 2015-01-18 21:30:19 -08:00
Tianqi Chen
56a80f431b check in windows solutions, pass small test in windows 2015-01-16 20:56:34 -08:00
tqchen
a7faac2f09 ok 2015-01-14 21:59:45 -08:00
tqchen
f161d2f1e5 fix bug in initialization of routing 2015-01-14 19:40:41 -08:00
tqchen
797fe27efe struct return type version 2015-01-14 15:43:28 -08:00
tqchen
a57c5c5425 add more error report when things goes wrong, need review 2015-01-14 15:32:36 -08:00
tqchen
87c7817124 add lazy check, need test, find a race condition 2015-01-14 11:58:43 -08:00
tqchen
348a1e7619 change default behavior to behave normal 2015-01-13 22:21:15 -08:00
tqchen
27d6977a3e cpplint pass 2014-12-28 05:12:07 -08:00
tqchen
a624051b85 add keepalive to socket, fix recover problem when a node is requester and pass data 2014-12-21 17:55:08 -08:00
tqchen
925d014271 change file structure 2014-12-20 16:19:54 -08:00
tqchen
2c0a0671ad skip actions when there is only 1 node 2014-12-19 19:21:21 -08:00
tqchen
6bf282c6c2 isolate iserializable 2014-12-19 17:36:42 -08:00
tqchen
1754fdbf4e enable support for lambda preprocessing function, and c++11 2014-12-19 02:00:43 -08:00
tqchen
58331067f8 cleanup testcases 2014-12-18 23:50:59 -08:00
tqchen
aa2cb38543 ResetLink still not ok 2014-12-18 21:45:38 -08:00
tqchen
c8faed0b54 pass local model recover test 2014-12-18 18:53:58 -08:00
tqchen
dbd05a65b5 nice fix, start check local check 2014-12-18 18:39:24 -08:00
tqchen
3f22596e3c check in license 2014-12-09 20:57:54 -08:00