32 Commits

Author SHA1 Message Date
Chen Qin
5797dcb64e support bootstrap allreduce/broadcast (#98)
* support run rabit tests as xgboost subproject using xgboost/dmlc-core

* support tracker config set/get

* remove redudant printf

* remove redudant printf

* add c++0x declaration

* log allreduce/broadcast caller, engine should track caller stack for
investigation

* tracker support binary config format

* Revert "tracker support binary config format"

This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.

* remove caller, prototype fetch allreduce/broadcast results from resbuf

* store cached allreduce/broadcast seq_no to tracker

* allow restore all caches from other nodes

* try new rabit collective cache, todo: recv_link seems down

* link up cache restore with main recovery

* cleanup load cache state

* update cache api

* pass test.mk

* have a working tests

* try to unify check into actionsummary

* more logging to debug distributed hist three method issue

* update rabit interface to support caller signature matching

* splite seq_counter from cur_cache_seq to different variables

* still see issue with inf loop

* support debug print caller as well as allreduce op

* cleanup

* remove get/set cache from model_recover, adding recover in
loadcheckpoint

* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries

* revert caller logs

* fix lint error

* fix engine mpi signature

* support getcache by ref

* allow result buffer presiet to filestream

* add loging

* try fix checkpoint failure recovery case

* use int64_t to avoid overflow caused seq fault

* try avoid int overflow

* try fix checkpoint failure recovery case

* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery

* fix cache seq assert error

* remove loging, handle edge case

* add extensive log to checkpoint state  with different seq no

* fix lint errors

* clean up comments before merge back to master

* add logs to allreduce/broadcast/checkpoint

* use unsinged int 32 and give seq no larger range

* address remove allreduce dropseq code segment

* using caller signature to filter bootstrapallreduces

* remove get/set cache from empty

* apply signature to reducer

* apply signature to broadcast

* add key to broadcat log

* fix broadcast signature

* fix default _line value for non linux system

* adding comments, remove sleep(1)

* fix osx build issue

* try fix mpi

* fix doc

* fix engine_empty api

* logging, adding more logs, restore immutable assertion

* print unsinged int with ud

* fix lint

* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine

* comment allreduce/broadcast log

* allow tests run on arm

* enable flag to turn on / off cache

* add log info alert if user choose to enable rabit bootstrap cache

* add rabit_debug setting so user can use config to turn on

* log flags when user turn on rabit_debug

* force rabit restart if tracker assign -1 rank

* use OPENMP to vecotrize reducer

* address comment

* Revert "address comment"

This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.

* fix checkpoint size print 0

* per feedback, remove DISABLEOPEMP, address race condition

* - remove openmp from this pr
- update name from cache to boostrapcache

* add default value of signature macros

* remove openmp from cmake file

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* run test with cmake

* remove openmp

* fix cmake based tests

* use cmake test fix darwin .dylib issue

* move around rabit_signature definition due to windows build

* misc, add c++ check in CMakeFile

* per feedback

* resolve CMake file

* update rabit version
2019-08-27 18:12:33 -07:00
Chen Qin
5c3b36f346 Allow using external dmlc-core (#91)
* Set `RABIT_BUILD_DMLC=1` if use dmlc-core in rabit

* remove dmlc-core
2019-04-26 15:28:45 +08:00
Chen Qin
e3d51d3e62 [rabit harden] Enable all tests (#90)
* include osx in tests
* address `time_wait` on port assignment
* increase submit attempts.
* cleanup tests
2019-04-24 19:12:11 +08:00
Chen Qin
ecd4bf7aae [rabit harden] replace hardcopy dmlc-core headers with submodule links (#86)
* backport dmlc header changes to rabit

* use gitmodule to reference latest dmlc header files

* include ref to dmlc-core
fix cmake

* update cmake file, add cmake build traivs task

* try force using g++-4.8

* per feedback, update cmake
2019-03-23 13:11:29 +08:00
Chen Qin
ed06e0c6af [rabit harden] fix rabit tests (#81)
* enable model recovery tests
* force use gcc4.8 in Travis
2019-03-15 07:16:45 +08:00
Jiaming Yuan
0101a4719c Remove dmlc logging. (#78)
* Remove dmlc logging header.

* Fix lint.
2019-02-16 18:37:54 -08:00
snehlatamohite
2eb1a1a371 Use -msse2 flag depending upon architecure while compiling the rabit code (#49) 2017-09-01 08:42:45 -07:00
Qiang Kou (KK)
41c96a25a9 To compile on ARM CPU (#46) 2017-07-12 20:24:19 -07:00
kabu4i
af1b7d6e7a Applied FreeBSD support (#37) 2016-11-15 21:10:51 -08:00
tqchen
7479791f6a refactor: librabit 2016-02-27 10:14:41 -08:00
tqchen
7b59dcb8b8 minor 2015-08-21 07:59:06 -07:00
tqchen
270a49ee75 add requirments 2015-07-23 22:22:52 -07:00
tqchen
3cc49ad0e8 lint and travis 2015-07-03 15:15:11 -07:00
tqchen
ceedf4ea96 fix 2015-05-28 12:37:06 -07:00
tqchen
8bbed35736 modify 2015-05-28 10:44:19 -07:00
tqchen
67ebf81e7a allow setup from env variables 2015-03-07 16:45:31 -08:00
tqchen
1bb8fe9615 chg makefile 2015-01-30 16:46:10 -08:00
tqchen
fb13cab216 change makefile 2015-01-30 16:30:45 -08:00
tqchen
85b746394e change def of reducer to take function ptr 2015-01-19 21:24:52 -08:00
tqchen
fe6366eb40 add engine base 2015-01-19 19:11:15 -08:00
tqchen
1db6449b01 remove include in -I, make things easier to direct compile 2015-01-18 21:30:19 -08:00
tqchen
f161d2f1e5 fix bug in initialization of routing 2015-01-14 19:40:41 -08:00
tqchen
348a1e7619 change default behavior to behave normal 2015-01-13 22:21:15 -08:00
tqchen
3419cf9aa7 add auto caching of python in hadoop script, mock test module to python, with checkpt 2015-01-13 14:29:10 -08:00
tqchen
2d72c853df checkin broadcast python module 2015-01-12 22:32:13 -08:00
tqchen
9a4a81f100 add wrapper 2015-01-12 21:33:01 -08:00
tqchen
1b4921977f update doc 2015-01-03 05:20:18 -08:00
tqchen
27d6977a3e cpplint pass 2014-12-28 05:12:07 -08:00
tqchen
10bb407a2c add mock engine 2014-12-20 18:31:33 -08:00
tqchen
925d014271 change file structure 2014-12-20 16:19:54 -08:00
tqchen
6151899ce2 add tracker print 2014-12-19 18:40:06 -08:00
tqchen
9abe6ad4d8 checkin makefile 2014-12-03 21:30:11 -08:00