[DOC] cleanup distributed training

2016-01-16 11:00:06 -08:00 · 2016-01-16 11:00:06 -08:00 · e7d8ed71d6
commit e7d8ed71d6
parent df7c7930d0
11 changed files with 155 additions and 237 deletions
--- a/CHANGES.md
+++ b/CHANGES.md
@ -1,43 +1,30 @@
-Change Log
-==========
+XGBoost Change Log
+==================

-xgboost-0.1
-----------
-* Initial release
+This file records the chanegs in xgboost library in reverse chronological order.

-xgboost-0.2x
------------
-* Python module
-* Weighted samples instances
-* Initial version of pairwise rank
+## brick: next release candidate
+* Major refactor of core library.
+  - Goal: more flexible and modular code as a portable library.
+  - Switch to use of c++11 standard code.
+  - Random number generator defaults to ```std::mt19937```.
+  - Share the data loading pipeline and logging module from dmlc-core.
+  - Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
+    - Future plugin modules can be put into xgboost/plugin and register back to the library.
+  - Remove most of the raw pointers to smart ptrs, for RAII safety.
+* Change library name to libxgboost.so
+* Backward compatiblity
+  - The binary buffer file is not backward compatible with previous version.
+  - The model file is backward compatible on 64 bit platforms.
+* The model file is compatible between 64/32 bit platforms(not yet tested).
+* External memory version and other advanced features will be exposed to R library as well on linux.
+  - Previously some of the features are blocked due to C++11 and threading limits.
+  - The windows version is still blocked due to Rtools do not support ```std::thread```.
+* rabit and dmlc-core are maintained through git submodule
+  - Anyone can open PR to update these dependencies now.

-xgboost-0.3
-----------
-* Faster tree construction module
-  - Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
-* Support for boosting from initial predictions
-* Experimental version of LambdaRank
-* Linear booster is now parallelized, using parallel coordinated descent.
-* Add [Code Guide](src/README.md) for customizing objective function and evaluation
-* Add R module
+## v0.47 (2016.01.14)

-xgboost-0.4
-----------
-* Distributed version of xgboost that runs on YARN, scales to billions of examples
-* Direct save/load data and model from/to S3 and HDFS
-* Feature importance visualization in R module, by Michael Benesty
-* Predict leaf index
-* Poisson regression for counts data
-* Early stopping option in training
-* Native save load support in R and python
-  - xgboost models now can be saved using save/load in R
-  - xgboost python model is now pickable
-* sklearn wrapper is supported in python module
-* Experimental External memory version
-
-
-xgboost-0.47
------------
 * Changes in R library
  - fixed possible problem of poisson regression.
  - switched from 0 to NA for missing values.
@ -58,23 +45,39 @@ xgboost-0.47
 * Java api is ready for use
 * Added more test cases and continuous integration to make each build more robust.

-xgboost brick: next release candidate
-------------------------------------
-* Major refactor of core library.
-  - Goal: more flexible and modular code as a portable library.
-  - Switch to use of c++11 standard code.
-  - Random number generator defaults to ```std::mt19937```.
-  - Share the data loading pipeline and logging module from dmlc-core.
-  - Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
-    - Future plugin modules can be put into xgboost/plugin and register back to the library.
-  - Remove most of the raw pointers to smart ptrs, for RAII safety.
-* Change library name to libxgboost.so
-* Backward compatiblity
-  - The binary buffer file is not backward compatible with previous version.
-  - The model file is backward compatible on 64 bit platforms.
-* The model file is compatible between 64/32 bit platforms(not yet tested).
-* External memory version and other advanced features will be exposed to R library as well on linux.
-  - Previously some of the features are blocked due to C++11 and threading limits.
-  - The windows version is still blocked due to Rtools do not support ```std::thread```.
-* rabit and dmlc-core are maintained through git submodule
-  - Anyone can open PR to update these dependencies now.
+## v0.4 (2015.05.11)
+
+* Distributed version of xgboost that runs on YARN, scales to billions of examples
+* Direct save/load data and model from/to S3 and HDFS
+* Feature importance visualization in R module, by Michael Benesty
+* Predict leaf index
+* Poisson regression for counts data
+* Early stopping option in training
+* Native save load support in R and python
+  - xgboost models now can be saved using save/load in R
+  - xgboost python model is now pickable
+* sklearn wrapper is supported in python module
+* Experimental External memory version
+
+
+## v0.3 (2014.09.07)
+
+* Faster tree construction module
+  - Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
+* Support for boosting from initial predictions
+* Experimental version of LambdaRank
+* Linear booster is now parallelized, using parallel coordinated descent.
+* Add [Code Guide](src/README.md) for customizing objective function and evaluation
+* Add R module
+
+
+## v0.2x (2014.05.20)
+
+* Python module
+* Weighted samples instances
+* Initial version of pairwise rank
+
+
+## v0.1 (2014.03.26)
+
+* Initial release
--- a/README.md
+++ b/README.md
@ -15,23 +15,14 @@ XGBoost is part of [DMLC](http://dmlc.github.io/) projects.

 Contents
 --------
-* [Documentation](https://xgboost.readthedocs.org)
-* [Usecases](doc/index.md#highlight-links)
+* [Documentation and Tutorials](https://xgboost.readthedocs.org)
 * [Code Examples](demo)
 * [Build Instruction](doc/build.md)
 * [Committers and Contributors](CONTRIBUTORS.md)

 What's New
 ----------
-* XGBoost [brick](CHANGES.md)
-* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
-* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
-* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
-
-Version
-------
-* Current version xgboost-0.6 (brick)
-  - See [Change log](CHANGES.md) for details
+* [XGBoost brick](NEWS.md) Release

 Features
 --------
@ -45,17 +36,16 @@ Features

 Bug Reporting
 -------------
-
 * For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page.
 * For generic questions or to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/)

-
 Contributing to XGBoost
 -----------------------
 XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
 * Check out [Feature Wish List](https://github.com/dmlc/xgboost/labels/Wish-List) to see what can be improved, or open an issue if you want something.
 * Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users.
-* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) after your patch has been merged.
+* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) and after your patch has been merged.
+  - Please also update [NEWS.md](NEWS.md) on changes and improvements in API and docs.

 License
 -------
--- a/demo/README.md
+++ b/demo/README.md
@ -44,8 +44,15 @@ However, the parameter settings can be applied to all versions
 * [Multiclass classification](multiclass_classification)
 * [Regression](regression)
 * [Learning to Rank](rank)
+* [Distributed Training](distributed-training)

 Benchmarks
 ----------
 * [Starter script for Kaggle Higgs Boson](kaggle-higgs)
 * [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
+
+Machine Learning Challenge Winning Solutions
+--------------------------------------------
+* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
+* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
+* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
--- a/demo/distributed-training/README.md
+++ b/demo/distributed-training/README.md
@ -0,0 +1,52 @@
+Distributed XGBoost Training
+============================
+This is an tutorial of Distributed XGBoost Training.
+Currently xgboost supports distributed training via CLI program with the configuration file.
+There is also plan push distributed python and other language bindings, please open an issue
+if you are interested in contributing.
+
+Build XGBoost with Distributed Filesystem Support
+-------------------------------------------------
+To use distributed xgboost, you only need to turn the options on to build
+with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
+
+How to Use
+----------
+* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
+* Put the data into some distribute filesytem (S3 or HDFS)
+* Use tracker script in dmlc-core/tracker to submit the jobs
+* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
+  - All the files in the folder will be used as input
+* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
+
+Example
+-------
+* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
+
+Single machine vs Distributed Version
+-------------------------------------
+If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
+* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
+* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
+  - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
+  - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
+  - The local path of cached files in command is "./".
+* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
+* The model saved by hadoop version is compatible with single machine version.
+
+Notes
+-----
+* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
+  - You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
+
+
+External Memory Version
+-----------------------
+XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
+See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
+
+You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
+```
+data=hdfs:///path-to-my-data/#dtrain.cache
+```
+This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
--- a/demo/distributed-training/run_yarn.sh
+++ b/demo/distributed-training/run_yarn.sh
@ -0,0 +1,33 @@
+#!/bin/bash
+if [ "$#" -lt 3 ];
+then
+	echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
+	exit -1
+fi
+
+# put the local training file to HDFS
+hadoop fs -mkdir $3/data
+hadoop fs -put ../data/agaricus.txt.train $3/data
+hadoop fs -put ../data/agaricus.txt.test $3/data
+
+# running rabit, pass address in hdfs
+../../dmlc-core/tracker/dmlc_yarn.py  -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
+    data=hdfs://$3/data/agaricus.txt.train\
+    eval[test]=hdfs://$3/data/agaricus.txt.test\
+    model_out=hdfs://$3/mushroom.final.model
+
+# get the final model file
+hadoop fs -get $3/mushroom.final.model final.model
+
+# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
+
+# output prediction task=pred
+#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
+../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
+# print the boosters of final.model in dump.raw.txt
+#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
+../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
+# use the feature map in printing for better visualization
+#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
+../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
+cat dump.nice.txt
--- a/multi-node/README.md
+++ b/multi-node/README.md
@ -1,28 +0,0 @@
-Distributed XGBoost
-======
-Distributed XGBoost is now part of [Wormhole](https://github.com/dmlc/wormhole).
-Checkout this [Link](https://github.com/dmlc/wormhole/tree/master/learn/xgboost) for usage examples, build and job submissions.
-* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit)
-  - Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning  
-  - This makes xgboost portable and fault-tolerant against node failures
-
-Notes
-====
-* Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
-  - The Hadoop version does not rely on Mapreduce to do iterations
-  - You can expect xgboost not suffering the drawbacks of iterative MapReduce program
-* The design choice was made because Allreduce is very natural and efficient for distributed tree building
-  - In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
-* The multi-threading nature of xgboost is inheritated in distributed mode
-  - This means xgboost efficiently use all the threads in one machine, and communicates only between machines
-  - Remember to run on xgboost process per machine and this will give you maximum speedup
-* For more information about rabit and how it works, see the [Rabit's Tutorial](https://github.com/dmlc/rabit/tree/master/guide)
-
-Solvers
-=====
-* Column-based solver split data by column, each node work on subset of columns, 
-  it uses exactly the same algorithm as single node version.
-* Row-based solver split data by row, each node work on subset of rows,
-  it uses an approximate histogram count algorithm, and will only examine subset of 
-  potential split points as opposed to all split points.
-  - This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
--- a/multi-node/col-split/README.md
+++ b/multi-node/col-split/README.md
@ -1,19 +0,0 @@
-Distributed XGBoost: Column Split Version
-====
-* run ```bash mushroom-col-rabit.sh <n-process>```
-  - mushroom-col-rabit.sh starts xgboost job using rabit's allreduce
-* run ```bash mushroom-col-rabit-mock.sh <n-process>```
-  - mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery
-
-How to Use
-====
-* First split the data by column, 
-* In the config, specify data file as containing a wildcard %d, where %d is the rank of the node, each node will load their part of data
-* Enable column split mode by ```dsplit=col```
-
-Notes
-====
-* The code is multi-threaded, so you want to run one process per node
-* The code will work correctly as long as union of each column subset is all the columns we are interested in.
-  - The column subset can overlap with each other.
-* It uses exactly the same algorithm as single node version, to examine all potential split points.
--- a/multi-node/col-split/mushroom-col-rabit-mock.sh
+++ b/multi-node/col-split/mushroom-col-rabit-mock.sh
@ -1,25 +0,0 @@
-#!/bin/bash
-if [[ $# -ne 1 ]]
-then
-    echo "Usage: nprocess"
-    exit -1
-fi
-
-#
-# This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi
-# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
-#
-rm -rf train.col* *.model
-k=$1
-
-# split the lib svm file into k subfiles
-python splitsvm.py ../../demo/data/agaricus.txt.train train $k
-
-# run xgboost mpi
-../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost.mock mushroom-col.conf dsplit=col mock=0,2,0,0 mock=1,2,0,0 mock=2,2,8,0 mock=2,3,0,0
-
-# the model can be directly loaded by single machine xgboost solver, as usuall
-#../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
-
-
-#cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col-rabit.sh
+++ b/multi-node/col-split/mushroom-col-rabit.sh
@ -1,28 +0,0 @@
-#!/bin/bash
-if [[ $# -ne 1 ]]
-then
-    echo "Usage: nprocess"
-    exit -1
-fi
-
-#
-# This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi
-# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
-#
-rm -rf train.col* *.model
-k=$1
-
-# split the lib svm file into k subfiles
-python splitsvm.py ../../demo/data/agaricus.txt.train train $k
-
-# run xgboost mpi
-../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf dsplit=col
-
-# the model can be directly loaded by single machine xgboost solver, as usuall
-../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt 
-
-# run for one round, and continue training
-../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost mushroom-col.conf dsplit=col num_round=1
-../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost mushroom-col.conf  mushroom-col.conf dsplit=col model_in=0001.model
-
-cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col.conf
+++ b/multi-node/col-split/mushroom-col.conf
@ -1,35 +0,0 @@
-# General Parameters, see comment for each definition
-# choose the booster, can be gbtree or gblinear
-booster = gbtree
-# choose logistic regression loss function for binary classification
-objective = binary:logistic
-
-# Tree Booster Parameters
-# step size shrinkage
-eta = 1.0 
-# minimum loss reduction required to make a further partition
-gamma = 1.0 
-# minimum sum of instance weight(hessian) needed in a child
-min_child_weight = 1 
-# maximum depth of a tree
-max_depth = 3 
-
-# Task Parameters
-# the number of round to do boosting
-num_round = 2
-# 0 means do not save any model except the final round model
-save_period = 0 
-use_buffer = 0
-
-# The path of training data %d is the wildcard for the rank of the data
-# The idea is each process take a feature matrix with subset of columns
-#
-data = "train.col%d" 
-
-# The path of validation data, used to monitor training process, here [test] sets name of the validation set
-eval[test] = "../../demo/data/agaricus.txt.test" 
-# evaluate on training data as well each round
-eval_train = 1
-
-# The path of test data, need to use full data of test, try not use it, or keep an subsampled version
-test:data = "../../demo/data/agaricus.txt.test"      
--- a/multi-node/col-split/splitsvm.py
+++ b/multi-node/col-split/splitsvm.py
@ -1,32 +0,0 @@
-#!/usr/bin/python
-import sys
-import random
-
-# split libsvm file into different subcolumns
-if len(sys.argv) < 4:
-    print ('Usage:<fin> <fo> k')
-    exit(0)
-
-random.seed(10)
-fmap = {}
-
-k = int(sys.argv[3])
-fi = open( sys.argv[1], 'r' )
-fos = []
-
-for i in range(k):
-    fos.append(open( sys.argv[2]+'.col%d' % i, 'w' ))
-    
-for l in open(sys.argv[1]):
-    arr = l.split()
-    for f in fos:
-        f.write(arr[0])
-    for it in arr[1:]:
-        fid = int(it.split(':')[0])
-        if fid not in fmap:
-            fmap[fid] = random.randint(0, k-1)
-        fos[fmap[fid]].write(' '+it)
-    for f in fos:
-        f.write('\n')
-for f in fos:    
-    f.close()