Matthew Jones 92b7577c62 [REVIEW] Enable Multi-Node Multi-GPU functionality (#4095 )

* Initial commit to support multi-node multi-gpu xgboost using dask

* Fixed NCCL initialization by not ignoring the opg parameter.

- it now crashes on NCCL initialization, but at least we're attempting it properly

* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers

* Synchronizing in a couple of more places.

- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places

* Added another missing max-allreduce operation inside BuildHistLeftRight

* Removed unnecessary collective operations.

* Simplified rabit::Allreduce() sync of gradient sums.

* Removed unnecessary rabit syncs around ncclAllReduce.

- this improves performance _significantly_ (7x faster for overall training,
  20x faster for xgboost proper)

* pulling in latest xgboost

* removing changes to updater_quantile_hist.cc

* changing use_nccl_opg initialization, removing unnecessary if statements

* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId

* placing struct defintion in guard to avoid duplicate code errors

* addressing linting errors

* removing

* removing additional arguments to AllReduer initialization

* removing distributed flag

* making comm init symmetric

* removing distributed flag

* changing ncclCommInit to support multiple modalities

* fix indenting

* updating ncclCommInitRank block with necessary group calls

* fix indenting

* adding print statement, and updating accessor in vector

* improving print statement to end-line

* generalizing nccl_rank construction using rabit

* assume device_ordinals is the same for every node

* test, assume device_ordinals is identical for all nodes

* test, assume device_ordinals is unique for all nodes

* changing names of offset variable to be more descriptive, editing indenting

* wrapping ncclUniqueId GetUniqueId() and aesthetic changes

* adding synchronization, and tests for distributed

* adding  to tests

* fixing broken #endif

* fixing initialization of gpu histograms, correcting errors in tests

* adding to contributors list

* adding distributed tests to jenkins

* fixing bad path in distributed test

* debugging

* adding kubernetes for distributed tests

* adding proper import for OrderedDict

* adding urllib3==1.22 to address ordered_dict import error

* added sleep to allow workers to save their models for comparison

* adding name to GPU contributors under docs

2019-03-02 10:03:22 +13:00

5.6 KiB

Raw Blame History

Contributors of DMLC/XGBoost

XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users.

Committers

Committers are people who have made substantial contribution to the project and granted write access to the project.

Tianqi Chen, University of Washington
- Tianqi is a Ph.D. student working on large-scale machine learning. He is the creator of the project.
Tong He, Amazon AI
- Tong is an applied scientist in Amazon AI. He is the maintainer of XGBoost R package.
Vadim Khotilovich
- Vadim contributes many improvements in R and core packages.
Bing Xu
- Bing is the original creator of XGBoost Python package and currently the maintainer of XGBoost.jl.
Michael Benesty
- Michael is a lawyer and data scientist in France. He is the creator of XGBoost interactive analysis module in R.
Yuan Tang, Ant Financial
- Yuan is a software engineer in Ant Financial. He contributed mostly in R and Python packages.
Nan Zhu, Uber
- Nan is a software engineer in Uber. He contributed mostly in JVM packages.
Sergei Lebedev, Criteo
- Sergei is a software engineer in Criteo. He contributed mostly in JVM packages.
Hongliang Liu
Scott Lundberg, University of Washington
- Scott is a Ph.D. student at University of Washington. He is the creator of SHAP, a unified approach to explain the output of machine learning models such as decision tree ensembles. He also helps maintain the XGBoost Julia package.
Rory Mitchell, University of Waikato
- Rory is a Ph.D. student at University of Waikato. He is the original creator of the GPU training algorithms. He improved the CMake build system and continuous integration.
Hyunsu Cho, Amazon AI
- Hyunsu is an applied scientist in Amazon AI. He is the maintainer of the XGBoost Python package. He also manages the Jenkins continuous integration system (https://xgboost-ci.net/). He is the initial author of the CPU 'hist' updater.
Jiaming
- Jiaming contributed to the GPU algorithms. He has also introduced new abstractions to improve the quality of the C++ codebase.

Become a Committer

XGBoost is a opensource project and we are actively looking for new committers who are willing to help maintaining and lead the project. Committers comes from contributors who:

Made substantial contribution to the project.
Willing to spent time on maintaining and lead the project.

New committers will be proposed by current committer members, with support from more than two of current committers.

List of Contributors

Full List of Contributors
- To contributors: please add your name to the list when you submit a patch to the project:)
Kailong Chen
- Kailong is an early contributor of XGBoost, he is creator of ranking objectives in XGBoost.
Skipper Seabold
- Skipper is the major contributor to the scikit-learn module of XGBoost.
Zygmunt Zając
- Zygmunt is the master behind the early stopping feature frequently used by kagglers.
Ajinkya Kale
Boliang Chen
Yangqing Men
- Yangqing is the creator of XGBoost java package.
Engpeng Yao
Giulio
- Giulio is the creator of Windows project of XGBoost
Jamie Hall
- Jamie is the initial creator of XGBoost scikit-learn module.
Yen-Ying Lee
Masaaki Horikoshi
- Masaaki is the initial creator of XGBoost Python plotting module.
daiyl0320
- daiyl0320 contributed patch to XGBoost distributed version more robust, and scales stably on TB scale datasets.
Huayi Zhang
Johan Manders
yoori
Mathias Müller
Sam Thomson
ganesh-krishnan
Damien Carol
Alex Bain
Baltazar Bieniek
Adam Pocock
Gideon Whitehead
Yi-Lin Juang
Andrew Hannigan
Andy Adinets
Henry Gouk
Pierre de Sahb
liuliang01
- liuliang01 added support for the qid column for LibSVM input format. This makes ranking task easier in distributed setting.
Andrew Thia
- Andrew Thia implemented feature interaction constraints
Wei Tian
Chen Qin
Sam Wilkinson
Matthew Jones

5.6 KiB Raw Blame History

Contributors of DMLC/XGBoost

Committers

Become a Committer

List of Contributors

5.6 KiB

Raw Blame History