Go to file

Chen Qin 5797dcb64e support bootstrap allreduce/broadcast (#98 )

* support run rabit tests as xgboost subproject using xgboost/dmlc-core

* support tracker config set/get

* remove redudant printf

* remove redudant printf

* add c++0x declaration

* log allreduce/broadcast caller, engine should track caller stack for
investigation

* tracker support binary config format

* Revert "tracker support binary config format"

This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.

* remove caller, prototype fetch allreduce/broadcast results from resbuf

* store cached allreduce/broadcast seq_no to tracker

* allow restore all caches from other nodes

* try new rabit collective cache, todo: recv_link seems down

* link up cache restore with main recovery

* cleanup load cache state

* update cache api

* pass test.mk

* have a working tests

* try to unify check into actionsummary

* more logging to debug distributed hist three method issue

* update rabit interface to support caller signature matching

* splite seq_counter from cur_cache_seq to different variables

* still see issue with inf loop

* support debug print caller as well as allreduce op

* cleanup

* remove get/set cache from model_recover, adding recover in
loadcheckpoint

* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries

* revert caller logs

* fix lint error

* fix engine mpi signature

* support getcache by ref

* allow result buffer presiet to filestream

* add loging

* try fix checkpoint failure recovery case

* use int64_t to avoid overflow caused seq fault

* try avoid int overflow

* try fix checkpoint failure recovery case

* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery

* fix cache seq assert error

* remove loging, handle edge case

* add extensive log to checkpoint state  with different seq no

* fix lint errors

* clean up comments before merge back to master

* add logs to allreduce/broadcast/checkpoint

* use unsinged int 32 and give seq no larger range

* address remove allreduce dropseq code segment

* using caller signature to filter bootstrapallreduces

* remove get/set cache from empty

* apply signature to reducer

* apply signature to broadcast

* add key to broadcat log

* fix broadcast signature

* fix default _line value for non linux system

* adding comments, remove sleep(1)

* fix osx build issue

* try fix mpi

* fix doc

* fix engine_empty api

* logging, adding more logs, restore immutable assertion

* print unsinged int with ud

* fix lint

* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine

* comment allreduce/broadcast log

* allow tests run on arm

* enable flag to turn on / off cache

* add log info alert if user choose to enable rabit bootstrap cache

* add rabit_debug setting so user can use config to turn on

* log flags when user turn on rabit_debug

* force rabit restart if tracker assign -1 rank

* use OPENMP to vecotrize reducer

* address comment

* Revert "address comment"

This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.

* fix checkpoint size print 0

* per feedback, remove DISABLEOPEMP, address race condition

* - remove openmp from this pr
- update name from cache to boostrapcache

* add default value of signature macros

* remove openmp from cmake file

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* run test with cmake

* remove openmp

* fix cmake based tests

* use cmake test fix darwin .dylib issue

* move around rabit_signature definition due to windows build

* misc, add c++ check in CMakeFile

* per feedback

* resolve CMake file

* update rabit version

2019-08-27 18:12:33 -07:00

cmake

add cmake w/ relocatable pkgconfig installation (#53 )

2018-01-07 14:49:39 -08:00

doc

disable travis model_recover tests, fix doc generate failure (#71 )

2018-10-19 18:18:16 -07:00

guide

Fixed print statements and xrange to be compatibile with Python 2 and 3. (#55 )

2018-02-26 12:19:04 -08:00

include/rabit

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

lib

update doc

2015-01-03 05:20:18 -08:00

python

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

scripts

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

src

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

test

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

.gitignore

[rabit harden] Enable all tests (#90 )

2019-04-24 19:12:11 +08:00

.travis.yml

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

CMakeLists.txt

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

LICENSE

license

2015-02-11 14:49:51 -08:00

Makefile

support bootstrap allreduce/broadcast (#98 )

2019-08-27 18:12:33 -07:00

README.md

Update README.md

2016-05-10 20:14:53 -07:00

README.md

Rabit: Reliable Allreduce and Broadcast Interface

rabit is a light weight library that provides a fault tolerant interface of Allreduce and Broadcast. It is designed to support easy implementations of distributed machine learning programs, many of which fall naturally under the Allreduce abstraction. The goal of rabit is to support portable , scalable and reliable distributed machine learning programs.

Tutorial
API Documentation
You can also directly read the interface header
XGBoost
- Rabit is one of the backbone library to support distributed XGBoost

Features

All these features comes from the facts about small rabbit:)

Portable: rabit is light weight and runs everywhere
- Rabit is a library instead of a framework, a program only needs to link the library to run
- Rabit only replies on a mechanism to start program, which was provided by most framework
- You can run rabit programs on many platforms, including Yarn(Hadoop), MPI using the same code
Scalable and Flexible: rabit runs fast
- Rabit program use Allreduce to communicate, and do not suffer the cost between iterations of MapReduce abstraction.
- Programs can call rabit functions in any order, as opposed to frameworks where callbacks are offered and called by the framework, i.e. inversion of control principle.
- Programs persist over all the iterations, unless they fail and recover.
Reliable: rabit dig burrows to avoid disasters
- Rabit programs can recover the model and results using synchronous function calls.

Use Rabit

Type make in the root folder will compile the rabit library in lib folder
Add lib to the library path and include to the include path of compiler
Languages: You can use rabit in C++ and python
- It is also possible to port the library to other languages

Contributing

Rabit is an open-source library, contributions are welcomed, including:

The rabit core library.
Customized tracker script for new platforms and interface of new languages.
Tutorial and examples about the library.

Languages

C++ 45.5%

Python 20.3%

Cuda 15.2%

R 6.8%

Scala 6.4%

Other 5.6%