[DOC] cleanup distributed training

2016-01-16 11:00:06 -08:00 · 2016-01-16 11:00:06 -08:00 · e7d8ed71d6
commit e7d8ed71d6
parent df7c7930d0
11 changed files with 155 additions and 237 deletions
--- a/CHANGES.md
+++ b/CHANGES.md
@ -1,43 +1,30 @@
-Change Log
+XGBoost Change Log
-==========
+==================
-xgboost-0.1
+This file records the chanegs in xgboost library in reverse chronological order.
 -----------
 * Initial release
-xgboost-0.2x
+## brick: next release candidate
------------
+* Major refactor of core library.
-* Python module
+  - Goal: more flexible and modular code as a portable library.
-* Weighted samples instances
+  - Switch to use of c++11 standard code.
-* Initial version of pairwise rank
+  - Random number generator defaults to ```std::mt19937```.
  - Share the data loading pipeline and logging module from dmlc-core.
  - Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
    - Future plugin modules can be put into xgboost/plugin and register back to the library.
  - Remove most of the raw pointers to smart ptrs, for RAII safety.
 * Change library name to libxgboost.so
 * Backward compatiblity
  - The binary buffer file is not backward compatible with previous version.
  - The model file is backward compatible on 64 bit platforms.
 * The model file is compatible between 64/32 bit platforms(not yet tested).
 * External memory version and other advanced features will be exposed to R library as well on linux.
  - Previously some of the features are blocked due to C++11 and threading limits.
  - The windows version is still blocked due to Rtools do not support ```std::thread```.
 * rabit and dmlc-core are maintained through git submodule
  - Anyone can open PR to update these dependencies now.
-xgboost-0.3
+## v0.47 (2016.01.14)
 -----------
 * Faster tree construction module
  - Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
 * Support for boosting from initial predictions
 * Experimental version of LambdaRank
 * Linear booster is now parallelized, using parallel coordinated descent.
 * Add [Code Guide](src/README.md) for customizing objective function and evaluation
 * Add R module
 xgboost-0.4
 -----------
 * Distributed version of xgboost that runs on YARN, scales to billions of examples
 * Direct save/load data and model from/to S3 and HDFS
 * Feature importance visualization in R module, by Michael Benesty
 * Predict leaf index
 * Poisson regression for counts data
 * Early stopping option in training
 * Native save load support in R and python
  - xgboost models now can be saved using save/load in R
  - xgboost python model is now pickable
 * sklearn wrapper is supported in python module
 * Experimental External memory version
 xgboost-0.47
 ------------
 * Changes in R library
  - fixed possible problem of poisson regression.
  - switched from 0 to NA for missing values.
@ -58,23 +45,39 @@ xgboost-0.47
 * Java api is ready for use
 * Added more test cases and continuous integration to make each build more robust.
-xgboost brick: next release candidate
+## v0.4 (2015.05.11)
-------------------------------------
+
-* Major refactor of core library.
+* Distributed version of xgboost that runs on YARN, scales to billions of examples
-  - Goal: more flexible and modular code as a portable library.
+* Direct save/load data and model from/to S3 and HDFS
-  - Switch to use of c++11 standard code.
+* Feature importance visualization in R module, by Michael Benesty
-  - Random number generator defaults to ```std::mt19937```.
+* Predict leaf index
-  - Share the data loading pipeline and logging module from dmlc-core.
+* Poisson regression for counts data
-  - Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
+* Early stopping option in training
-    - Future plugin modules can be put into xgboost/plugin and register back to the library.
+* Native save load support in R and python
-  - Remove most of the raw pointers to smart ptrs, for RAII safety.
+  - xgboost models now can be saved using save/load in R
-* Change library name to libxgboost.so
+  - xgboost python model is now pickable
-* Backward compatiblity
+* sklearn wrapper is supported in python module
-  - The binary buffer file is not backward compatible with previous version.
+* Experimental External memory version
-  - The model file is backward compatible on 64 bit platforms.
+
-* The model file is compatible between 64/32 bit platforms(not yet tested).
+
-* External memory version and other advanced features will be exposed to R library as well on linux.
+## v0.3 (2014.09.07)
-  - Previously some of the features are blocked due to C++11 and threading limits.
+
-  - The windows version is still blocked due to Rtools do not support ```std::thread```.
+* Faster tree construction module
-* rabit and dmlc-core are maintained through git submodule
+  - Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
-  - Anyone can open PR to update these dependencies now.
+* Support for boosting from initial predictions
 * Experimental version of LambdaRank
 * Linear booster is now parallelized, using parallel coordinated descent.
 * Add [Code Guide](src/README.md) for customizing objective function and evaluation
 * Add R module
 ## v0.2x (2014.05.20)
 * Python module
 * Weighted samples instances
 * Initial version of pairwise rank
 ## v0.1 (2014.03.26)
 * Initial release
--- a/README.md
+++ b/README.md
@ -15,23 +15,14 @@ XGBoost is part of [DMLC](http://dmlc.github.io/) projects.
 Contents
 --------
-* [Documentation](https://xgboost.readthedocs.org)
+* [Documentation and Tutorials](https://xgboost.readthedocs.org)
 * [Usecases](doc/index.md#highlight-links)
 * [Code Examples](demo)
 * [Build Instruction](doc/build.md)
 * [Committers and Contributors](CONTRIBUTORS.md)
 What's New
 ----------
-* XGBoost [brick](CHANGES.md)
+* [XGBoost brick](NEWS.md) Release
 * XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
 * XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
 * XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
 Version
 -------
 * Current version xgboost-0.6 (brick)
  - See [Change log](CHANGES.md) for details
 Features
 --------
@ -45,17 +36,16 @@ Features
 Bug Reporting
 -------------
 * For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page.
 * For generic questions or to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/)
 Contributing to XGBoost
 -----------------------
 XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
 * Check out [Feature Wish List](https://github.com/dmlc/xgboost/labels/Wish-List) to see what can be improved, or open an issue if you want something.
 * Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users.
-* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) after your patch has been merged.
+* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) and after your patch has been merged.
  - Please also update [NEWS.md](NEWS.md) on changes and improvements in API and docs.
 License
 -------
--- a/demo/README.md
+++ b/demo/README.md
@ -44,8 +44,15 @@ However, the parameter settings can be applied to all versions
 * [Multiclass classification](multiclass_classification)
 * [Regression](regression)
 * [Learning to Rank](rank)
 * [Distributed Training](distributed-training)
 Benchmarks
 ----------
 * [Starter script for Kaggle Higgs Boson](kaggle-higgs)
 * [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
 Machine Learning Challenge Winning Solutions
 --------------------------------------------
 * XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
 * XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
 * XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
--- a/demo/distributed-training/README.md
+++ b/demo/distributed-training/README.md
@ -0,0 +1,52 @@
 Distributed XGBoost Training
 ============================
 This is an tutorial of Distributed XGBoost Training.
 Currently xgboost supports distributed training via CLI program with the configuration file.
 There is also plan push distributed python and other language bindings, please open an issue
 if you are interested in contributing.
 Build XGBoost with Distributed Filesystem Support
 -------------------------------------------------
 To use distributed xgboost, you only need to turn the options on to build
 with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
 How to Use
 ----------
 * Input data format: LIBSVM format. The example here uses generated data in ../data folder.
 * Put the data into some distribute filesytem (S3 or HDFS)
 * Use tracker script in dmlc-core/tracker to submit the jobs
 * Like all other DMLC tools, xgboost support taking a path to a folder as input argument
  - All the files in the folder will be used as input
 * Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
 Example
 -------
 * [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
 Single machine vs Distributed Version
 -------------------------------------
 If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
 * IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
 * File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
  - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
  - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
  - The local path of cached files in command is "./".
 * More details of submission can be referred to the usage of ```dmlc_yarn.py```.
 * The model saved by hadoop version is compatible with single machine version.
 Notes
 -----
 * The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
  - You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
 External Memory Version
 -----------------------
 XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
 See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
 You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
 ```
 data=hdfs:///path-to-my-data/#dtrain.cache
 ```
 This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
--- a/demo/distributed-training/run_yarn.sh
+++ b/demo/distributed-training/run_yarn.sh
@ -0,0 +1,33 @@
 #!/bin/bash
 if [ "$#" -lt 3 ];
 then
 	echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
 	exit -1
 fi
 # put the local training file to HDFS
 hadoop fs -mkdir $3/data
 hadoop fs -put ../data/agaricus.txt.train $3/data
 hadoop fs -put ../data/agaricus.txt.test $3/data
 # running rabit, pass address in hdfs
 ../../dmlc-core/tracker/dmlc_yarn.py  -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
    data=hdfs://$3/data/agaricus.txt.train\
    eval[test]=hdfs://$3/data/agaricus.txt.test\
    model_out=hdfs://$3/mushroom.final.model
 # get the final model file
 hadoop fs -get $3/mushroom.final.model final.model
 # use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
 # output prediction task=pred
 #../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
 ../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
 # print the boosters of final.model in dump.raw.txt
 #../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
 ../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
 # use the feature map in printing for better visualization
 #../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
 ../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
 cat dump.nice.txt
--- a/multi-node/README.md
+++ b/multi-node/README.md
@ -1,28 +0,0 @@
 Distributed XGBoost
 ======
 Distributed XGBoost is now part of [Wormhole](https://github.com/dmlc/wormhole).
 Checkout this [Link](https://github.com/dmlc/wormhole/tree/master/learn/xgboost) for usage examples, build and job submissions.
 * The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit)
  - Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning  
  - This makes xgboost portable and fault-tolerant against node failures
 Notes
 ====
 * Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
  - The Hadoop version does not rely on Mapreduce to do iterations
  - You can expect xgboost not suffering the drawbacks of iterative MapReduce program
 * The design choice was made because Allreduce is very natural and efficient for distributed tree building
  - In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
 * The multi-threading nature of xgboost is inheritated in distributed mode
  - This means xgboost efficiently use all the threads in one machine, and communicates only between machines
  - Remember to run on xgboost process per machine and this will give you maximum speedup
 * For more information about rabit and how it works, see the [Rabit's Tutorial](https://github.com/dmlc/rabit/tree/master/guide)
 Solvers
 =====
 * Column-based solver split data by column, each node work on subset of columns, 
  it uses exactly the same algorithm as single node version.
 * Row-based solver split data by row, each node work on subset of rows,
  it uses an approximate histogram count algorithm, and will only examine subset of 
  potential split points as opposed to all split points.
  - This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
--- a/multi-node/col-split/README.md
+++ b/multi-node/col-split/README.md
@ -1,19 +0,0 @@
 Distributed XGBoost: Column Split Version
 ====
 * run ```bash mushroom-col-rabit.sh <n-process>```
  - mushroom-col-rabit.sh starts xgboost job using rabit's allreduce
 * run ```bash mushroom-col-rabit-mock.sh <n-process>```
  - mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery
 How to Use
 ====
 * First split the data by column, 
 * In the config, specify data file as containing a wildcard %d, where %d is the rank of the node, each node will load their part of data
 * Enable column split mode by ```dsplit=col```
 Notes
 ====
 * The code is multi-threaded, so you want to run one process per node
 * The code will work correctly as long as union of each column subset is all the columns we are interested in.
  - The column subset can overlap with each other.
 * It uses exactly the same algorithm as single node version, to examine all potential split points.
--- a/multi-node/col-split/mushroom-col-rabit-mock.sh
+++ b/multi-node/col-split/mushroom-col-rabit-mock.sh
@ -1,25 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 #
 # This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi
 # xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
 #
 rm -rf train.col* *.model
 k=$1
 # split the lib svm file into k subfiles
 python splitsvm.py ../../demo/data/agaricus.txt.train train $k
 # run xgboost mpi
 ../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost.mock mushroom-col.conf dsplit=col mock=0,2,0,0 mock=1,2,0,0 mock=2,2,8,0 mock=2,3,0,0
 # the model can be directly loaded by single machine xgboost solver, as usuall
 #../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
 #cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col-rabit.sh
+++ b/multi-node/col-split/mushroom-col-rabit.sh
@ -1,28 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 #
 # This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi
 # xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
 #
 rm -rf train.col* *.model
 k=$1
 # split the lib svm file into k subfiles
 python splitsvm.py ../../demo/data/agaricus.txt.train train $k
 # run xgboost mpi
 ../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf dsplit=col
 # the model can be directly loaded by single machine xgboost solver, as usuall
 ../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt 
 # run for one round, and continue training
 ../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost mushroom-col.conf dsplit=col num_round=1
 ../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost mushroom-col.conf  mushroom-col.conf dsplit=col model_in=0001.model
 cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col.conf
+++ b/multi-node/col-split/mushroom-col.conf
@ -1,35 +0,0 @@
 # General Parameters, see comment for each definition
 # choose the booster, can be gbtree or gblinear
 booster = gbtree
 # choose logistic regression loss function for binary classification
 objective = binary:logistic
 # Tree Booster Parameters
 # step size shrinkage
 eta = 1.0 
 # minimum loss reduction required to make a further partition
 gamma = 1.0 
 # minimum sum of instance weight(hessian) needed in a child
 min_child_weight = 1 
 # maximum depth of a tree
 max_depth = 3 
 # Task Parameters
 # the number of round to do boosting
 num_round = 2
 # 0 means do not save any model except the final round model
 save_period = 0 
 use_buffer = 0
 # The path of training data %d is the wildcard for the rank of the data
 # The idea is each process take a feature matrix with subset of columns
 #
 data = "train.col%d" 
 # The path of validation data, used to monitor training process, here [test] sets name of the validation set
 eval[test] = "../../demo/data/agaricus.txt.test" 
 # evaluate on training data as well each round
 eval_train = 1
 # The path of test data, need to use full data of test, try not use it, or keep an subsampled version
 test:data = "../../demo/data/agaricus.txt.test"      
--- a/multi-node/col-split/splitsvm.py
+++ b/multi-node/col-split/splitsvm.py
@ -1,32 +0,0 @@
 #!/usr/bin/python
 import sys
 import random
 # split libsvm file into different subcolumns
 if len(sys.argv) < 4:
    print ('Usage:<fin> <fo> k')
    exit(0)
 random.seed(10)
 fmap = {}
 k = int(sys.argv[3])
 fi = open( sys.argv[1], 'r' )
 fos = []
 for i in range(k):
    fos.append(open( sys.argv[2]+'.col%d' % i, 'w' ))
 for l in open(sys.argv[1]):
    arr = l.split()
    for f in fos:
        f.write(arr[0])
    for it in arr[1:]:
        fid = int(it.split(':')[0])
        if fid not in fmap:
            fmap[fid] = random.randint(0, k-1)
        fos[fmap[fid]].write(' '+it)
    for f in fos:
        f.write('\n')
 for f in fos:    
    f.close()