14 Commits

Author SHA1 Message Date
Chen Qin
9a7ac85d7e remove is_bootstrap parameter (#102)
* apply openmp simd

* clean __buildin detection, moving windows build check from xgboost project, add openmp support for vectorize reduce

* apply openmp only to rabit

* orgnize rabit signature

* remove is_bootstrap, use load_checkpoint as implict flag

* visual studio don't support latest openmp

* orgnize omp declarations

* replace memory copy with vector cast

* Revert "replace memory copy with vector cast"

This reverts commit 28de4792dcdff40d83d458510d23b7ef0b191d79.

* Revert "orgnize omp declarations"

This reverts commit 31341233d31ce93ccf34d700262b1f3f6690bbfe.

* remove openmp settings, merge into a upcoming pr

* mis

* per feedback, update comments
2019-09-10 11:45:50 -07:00
Chen Qin
5797dcb64e support bootstrap allreduce/broadcast (#98)
* support run rabit tests as xgboost subproject using xgboost/dmlc-core

* support tracker config set/get

* remove redudant printf

* remove redudant printf

* add c++0x declaration

* log allreduce/broadcast caller, engine should track caller stack for
investigation

* tracker support binary config format

* Revert "tracker support binary config format"

This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.

* remove caller, prototype fetch allreduce/broadcast results from resbuf

* store cached allreduce/broadcast seq_no to tracker

* allow restore all caches from other nodes

* try new rabit collective cache, todo: recv_link seems down

* link up cache restore with main recovery

* cleanup load cache state

* update cache api

* pass test.mk

* have a working tests

* try to unify check into actionsummary

* more logging to debug distributed hist three method issue

* update rabit interface to support caller signature matching

* splite seq_counter from cur_cache_seq to different variables

* still see issue with inf loop

* support debug print caller as well as allreduce op

* cleanup

* remove get/set cache from model_recover, adding recover in
loadcheckpoint

* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries

* revert caller logs

* fix lint error

* fix engine mpi signature

* support getcache by ref

* allow result buffer presiet to filestream

* add loging

* try fix checkpoint failure recovery case

* use int64_t to avoid overflow caused seq fault

* try avoid int overflow

* try fix checkpoint failure recovery case

* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery

* fix cache seq assert error

* remove loging, handle edge case

* add extensive log to checkpoint state  with different seq no

* fix lint errors

* clean up comments before merge back to master

* add logs to allreduce/broadcast/checkpoint

* use unsinged int 32 and give seq no larger range

* address remove allreduce dropseq code segment

* using caller signature to filter bootstrapallreduces

* remove get/set cache from empty

* apply signature to reducer

* apply signature to broadcast

* add key to broadcat log

* fix broadcast signature

* fix default _line value for non linux system

* adding comments, remove sleep(1)

* fix osx build issue

* try fix mpi

* fix doc

* fix engine_empty api

* logging, adding more logs, restore immutable assertion

* print unsinged int with ud

* fix lint

* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine

* comment allreduce/broadcast log

* allow tests run on arm

* enable flag to turn on / off cache

* add log info alert if user choose to enable rabit bootstrap cache

* add rabit_debug setting so user can use config to turn on

* log flags when user turn on rabit_debug

* force rabit restart if tracker assign -1 rank

* use OPENMP to vecotrize reducer

* address comment

* Revert "address comment"

This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.

* fix checkpoint size print 0

* per feedback, remove DISABLEOPEMP, address race condition

* - remove openmp from this pr
- update name from cache to boostrapcache

* add default value of signature macros

* remove openmp from cmake file

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* run test with cmake

* remove openmp

* fix cmake based tests

* use cmake test fix darwin .dylib issue

* move around rabit_signature definition due to windows build

* misc, add c++ check in CMakeFile

* per feedback

* resolve CMake file

* update rabit version
2019-08-27 18:12:33 -07:00
Nan Zhu
65b718a5e7
return values in Init and Finalize (#96)
* make inti function return values

* address the comments
2019-06-25 20:05:54 -07:00
Nan Zhu
fc85f776f4
allow not stop process in error (#97)
* allow not stop process in error

* fix merge error
2019-06-25 13:04:39 -07:00
tqchen
f0f07ecd22 fix 2016-02-27 20:51:00 -08:00
tqchen
50a66b3855 fix empty engine 2015-04-09 08:44:33 -07:00
tqchen
d558f6f550 redefine distributed means 2015-03-09 14:43:05 -07:00
tqchen
4db0a62a06 bugfix of lazy prepare 2015-02-11 20:31:46 -08:00
tqchen
1db6449b01 remove include in -I, make things easier to direct compile 2015-01-18 21:30:19 -08:00
tqchen
87c7817124 add lazy check, need test, find a race condition 2015-01-14 11:58:43 -08:00
tqchen
27d6977a3e cpplint pass 2014-12-28 05:12:07 -08:00
tqchen
925d014271 change file structure 2014-12-20 16:19:54 -08:00
tqchen
e72a869fd1 add complex reducer in 2014-12-19 20:57:53 -08:00
tqchen
6151899ce2 add tracker print 2014-12-19 18:40:06 -08:00