[DOC] cleanup distributed training
This commit is contained in:
parent
df7c7930d0
commit
e7d8ed71d6
@ -1,43 +1,30 @@
|
|||||||
Change Log
|
XGBoost Change Log
|
||||||
==========
|
==================
|
||||||
|
|
||||||
xgboost-0.1
|
This file records the chanegs in xgboost library in reverse chronological order.
|
||||||
-----------
|
|
||||||
* Initial release
|
|
||||||
|
|
||||||
xgboost-0.2x
|
## brick: next release candidate
|
||||||
------------
|
* Major refactor of core library.
|
||||||
* Python module
|
- Goal: more flexible and modular code as a portable library.
|
||||||
* Weighted samples instances
|
- Switch to use of c++11 standard code.
|
||||||
* Initial version of pairwise rank
|
- Random number generator defaults to ```std::mt19937```.
|
||||||
|
- Share the data loading pipeline and logging module from dmlc-core.
|
||||||
|
- Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
|
||||||
|
- Future plugin modules can be put into xgboost/plugin and register back to the library.
|
||||||
|
- Remove most of the raw pointers to smart ptrs, for RAII safety.
|
||||||
|
* Change library name to libxgboost.so
|
||||||
|
* Backward compatiblity
|
||||||
|
- The binary buffer file is not backward compatible with previous version.
|
||||||
|
- The model file is backward compatible on 64 bit platforms.
|
||||||
|
* The model file is compatible between 64/32 bit platforms(not yet tested).
|
||||||
|
* External memory version and other advanced features will be exposed to R library as well on linux.
|
||||||
|
- Previously some of the features are blocked due to C++11 and threading limits.
|
||||||
|
- The windows version is still blocked due to Rtools do not support ```std::thread```.
|
||||||
|
* rabit and dmlc-core are maintained through git submodule
|
||||||
|
- Anyone can open PR to update these dependencies now.
|
||||||
|
|
||||||
xgboost-0.3
|
## v0.47 (2016.01.14)
|
||||||
-----------
|
|
||||||
* Faster tree construction module
|
|
||||||
- Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
|
|
||||||
* Support for boosting from initial predictions
|
|
||||||
* Experimental version of LambdaRank
|
|
||||||
* Linear booster is now parallelized, using parallel coordinated descent.
|
|
||||||
* Add [Code Guide](src/README.md) for customizing objective function and evaluation
|
|
||||||
* Add R module
|
|
||||||
|
|
||||||
xgboost-0.4
|
|
||||||
-----------
|
|
||||||
* Distributed version of xgboost that runs on YARN, scales to billions of examples
|
|
||||||
* Direct save/load data and model from/to S3 and HDFS
|
|
||||||
* Feature importance visualization in R module, by Michael Benesty
|
|
||||||
* Predict leaf index
|
|
||||||
* Poisson regression for counts data
|
|
||||||
* Early stopping option in training
|
|
||||||
* Native save load support in R and python
|
|
||||||
- xgboost models now can be saved using save/load in R
|
|
||||||
- xgboost python model is now pickable
|
|
||||||
* sklearn wrapper is supported in python module
|
|
||||||
* Experimental External memory version
|
|
||||||
|
|
||||||
|
|
||||||
xgboost-0.47
|
|
||||||
------------
|
|
||||||
* Changes in R library
|
* Changes in R library
|
||||||
- fixed possible problem of poisson regression.
|
- fixed possible problem of poisson regression.
|
||||||
- switched from 0 to NA for missing values.
|
- switched from 0 to NA for missing values.
|
||||||
@ -58,23 +45,39 @@ xgboost-0.47
|
|||||||
* Java api is ready for use
|
* Java api is ready for use
|
||||||
* Added more test cases and continuous integration to make each build more robust.
|
* Added more test cases and continuous integration to make each build more robust.
|
||||||
|
|
||||||
xgboost brick: next release candidate
|
## v0.4 (2015.05.11)
|
||||||
-------------------------------------
|
|
||||||
* Major refactor of core library.
|
* Distributed version of xgboost that runs on YARN, scales to billions of examples
|
||||||
- Goal: more flexible and modular code as a portable library.
|
* Direct save/load data and model from/to S3 and HDFS
|
||||||
- Switch to use of c++11 standard code.
|
* Feature importance visualization in R module, by Michael Benesty
|
||||||
- Random number generator defaults to ```std::mt19937```.
|
* Predict leaf index
|
||||||
- Share the data loading pipeline and logging module from dmlc-core.
|
* Poisson regression for counts data
|
||||||
- Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
|
* Early stopping option in training
|
||||||
- Future plugin modules can be put into xgboost/plugin and register back to the library.
|
* Native save load support in R and python
|
||||||
- Remove most of the raw pointers to smart ptrs, for RAII safety.
|
- xgboost models now can be saved using save/load in R
|
||||||
* Change library name to libxgboost.so
|
- xgboost python model is now pickable
|
||||||
* Backward compatiblity
|
* sklearn wrapper is supported in python module
|
||||||
- The binary buffer file is not backward compatible with previous version.
|
* Experimental External memory version
|
||||||
- The model file is backward compatible on 64 bit platforms.
|
|
||||||
* The model file is compatible between 64/32 bit platforms(not yet tested).
|
|
||||||
* External memory version and other advanced features will be exposed to R library as well on linux.
|
## v0.3 (2014.09.07)
|
||||||
- Previously some of the features are blocked due to C++11 and threading limits.
|
|
||||||
- The windows version is still blocked due to Rtools do not support ```std::thread```.
|
* Faster tree construction module
|
||||||
* rabit and dmlc-core are maintained through git submodule
|
- Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
|
||||||
- Anyone can open PR to update these dependencies now.
|
* Support for boosting from initial predictions
|
||||||
|
* Experimental version of LambdaRank
|
||||||
|
* Linear booster is now parallelized, using parallel coordinated descent.
|
||||||
|
* Add [Code Guide](src/README.md) for customizing objective function and evaluation
|
||||||
|
* Add R module
|
||||||
|
|
||||||
|
|
||||||
|
## v0.2x (2014.05.20)
|
||||||
|
|
||||||
|
* Python module
|
||||||
|
* Weighted samples instances
|
||||||
|
* Initial version of pairwise rank
|
||||||
|
|
||||||
|
|
||||||
|
## v0.1 (2014.03.26)
|
||||||
|
|
||||||
|
* Initial release
|
||||||
18
README.md
18
README.md
@ -15,23 +15,14 @@ XGBoost is part of [DMLC](http://dmlc.github.io/) projects.
|
|||||||
|
|
||||||
Contents
|
Contents
|
||||||
--------
|
--------
|
||||||
* [Documentation](https://xgboost.readthedocs.org)
|
* [Documentation and Tutorials](https://xgboost.readthedocs.org)
|
||||||
* [Usecases](doc/index.md#highlight-links)
|
|
||||||
* [Code Examples](demo)
|
* [Code Examples](demo)
|
||||||
* [Build Instruction](doc/build.md)
|
* [Build Instruction](doc/build.md)
|
||||||
* [Committers and Contributors](CONTRIBUTORS.md)
|
* [Committers and Contributors](CONTRIBUTORS.md)
|
||||||
|
|
||||||
What's New
|
What's New
|
||||||
----------
|
----------
|
||||||
* XGBoost [brick](CHANGES.md)
|
* [XGBoost brick](NEWS.md) Release
|
||||||
* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
|
|
||||||
* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
|
|
||||||
* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
|
|
||||||
|
|
||||||
Version
|
|
||||||
-------
|
|
||||||
* Current version xgboost-0.6 (brick)
|
|
||||||
- See [Change log](CHANGES.md) for details
|
|
||||||
|
|
||||||
Features
|
Features
|
||||||
--------
|
--------
|
||||||
@ -45,17 +36,16 @@ Features
|
|||||||
|
|
||||||
Bug Reporting
|
Bug Reporting
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
* For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page.
|
* For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page.
|
||||||
* For generic questions or to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/)
|
* For generic questions or to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/)
|
||||||
|
|
||||||
|
|
||||||
Contributing to XGBoost
|
Contributing to XGBoost
|
||||||
-----------------------
|
-----------------------
|
||||||
XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
|
XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
|
||||||
* Check out [Feature Wish List](https://github.com/dmlc/xgboost/labels/Wish-List) to see what can be improved, or open an issue if you want something.
|
* Check out [Feature Wish List](https://github.com/dmlc/xgboost/labels/Wish-List) to see what can be improved, or open an issue if you want something.
|
||||||
* Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users.
|
* Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users.
|
||||||
* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) after your patch has been merged.
|
* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) and after your patch has been merged.
|
||||||
|
- Please also update [NEWS.md](NEWS.md) on changes and improvements in API and docs.
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
-------
|
||||||
|
|||||||
@ -44,8 +44,15 @@ However, the parameter settings can be applied to all versions
|
|||||||
* [Multiclass classification](multiclass_classification)
|
* [Multiclass classification](multiclass_classification)
|
||||||
* [Regression](regression)
|
* [Regression](regression)
|
||||||
* [Learning to Rank](rank)
|
* [Learning to Rank](rank)
|
||||||
|
* [Distributed Training](distributed-training)
|
||||||
|
|
||||||
Benchmarks
|
Benchmarks
|
||||||
----------
|
----------
|
||||||
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)
|
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)
|
||||||
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
|
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
|
||||||
|
|
||||||
|
Machine Learning Challenge Winning Solutions
|
||||||
|
--------------------------------------------
|
||||||
|
* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
|
||||||
|
* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
|
||||||
|
* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
|
||||||
|
|||||||
52
demo/distributed-training/README.md
Normal file
52
demo/distributed-training/README.md
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
Distributed XGBoost Training
|
||||||
|
============================
|
||||||
|
This is an tutorial of Distributed XGBoost Training.
|
||||||
|
Currently xgboost supports distributed training via CLI program with the configuration file.
|
||||||
|
There is also plan push distributed python and other language bindings, please open an issue
|
||||||
|
if you are interested in contributing.
|
||||||
|
|
||||||
|
Build XGBoost with Distributed Filesystem Support
|
||||||
|
-------------------------------------------------
|
||||||
|
To use distributed xgboost, you only need to turn the options on to build
|
||||||
|
with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
|
||||||
|
|
||||||
|
How to Use
|
||||||
|
----------
|
||||||
|
* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
|
||||||
|
* Put the data into some distribute filesytem (S3 or HDFS)
|
||||||
|
* Use tracker script in dmlc-core/tracker to submit the jobs
|
||||||
|
* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
|
||||||
|
- All the files in the folder will be used as input
|
||||||
|
* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
|
||||||
|
|
||||||
|
Example
|
||||||
|
-------
|
||||||
|
* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
|
||||||
|
|
||||||
|
Single machine vs Distributed Version
|
||||||
|
-------------------------------------
|
||||||
|
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
|
||||||
|
* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
|
||||||
|
* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
|
||||||
|
- ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
|
||||||
|
- You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
|
||||||
|
- The local path of cached files in command is "./".
|
||||||
|
* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
|
||||||
|
* The model saved by hadoop version is compatible with single machine version.
|
||||||
|
|
||||||
|
Notes
|
||||||
|
-----
|
||||||
|
* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
|
||||||
|
- You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
|
||||||
|
|
||||||
|
|
||||||
|
External Memory Version
|
||||||
|
-----------------------
|
||||||
|
XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
|
||||||
|
See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
|
||||||
|
|
||||||
|
You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
|
||||||
|
```
|
||||||
|
data=hdfs:///path-to-my-data/#dtrain.cache
|
||||||
|
```
|
||||||
|
This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
|
||||||
33
demo/distributed-training/run_yarn.sh
Executable file
33
demo/distributed-training/run_yarn.sh
Executable file
@ -0,0 +1,33 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
if [ "$#" -lt 3 ];
|
||||||
|
then
|
||||||
|
echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
|
||||||
|
exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# put the local training file to HDFS
|
||||||
|
hadoop fs -mkdir $3/data
|
||||||
|
hadoop fs -put ../data/agaricus.txt.train $3/data
|
||||||
|
hadoop fs -put ../data/agaricus.txt.test $3/data
|
||||||
|
|
||||||
|
# running rabit, pass address in hdfs
|
||||||
|
../../dmlc-core/tracker/dmlc_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
|
||||||
|
data=hdfs://$3/data/agaricus.txt.train\
|
||||||
|
eval[test]=hdfs://$3/data/agaricus.txt.test\
|
||||||
|
model_out=hdfs://$3/mushroom.final.model
|
||||||
|
|
||||||
|
# get the final model file
|
||||||
|
hadoop fs -get $3/mushroom.final.model final.model
|
||||||
|
|
||||||
|
# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
|
||||||
|
|
||||||
|
# output prediction task=pred
|
||||||
|
#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
|
||||||
|
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
|
||||||
|
# print the boosters of final.model in dump.raw.txt
|
||||||
|
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
||||||
|
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
||||||
|
# use the feature map in printing for better visualization
|
||||||
|
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
|
||||||
|
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
|
||||||
|
cat dump.nice.txt
|
||||||
@ -1,28 +0,0 @@
|
|||||||
Distributed XGBoost
|
|
||||||
======
|
|
||||||
Distributed XGBoost is now part of [Wormhole](https://github.com/dmlc/wormhole).
|
|
||||||
Checkout this [Link](https://github.com/dmlc/wormhole/tree/master/learn/xgboost) for usage examples, build and job submissions.
|
|
||||||
* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit)
|
|
||||||
- Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning
|
|
||||||
- This makes xgboost portable and fault-tolerant against node failures
|
|
||||||
|
|
||||||
Notes
|
|
||||||
====
|
|
||||||
* Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
|
|
||||||
- The Hadoop version does not rely on Mapreduce to do iterations
|
|
||||||
- You can expect xgboost not suffering the drawbacks of iterative MapReduce program
|
|
||||||
* The design choice was made because Allreduce is very natural and efficient for distributed tree building
|
|
||||||
- In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
|
|
||||||
* The multi-threading nature of xgboost is inheritated in distributed mode
|
|
||||||
- This means xgboost efficiently use all the threads in one machine, and communicates only between machines
|
|
||||||
- Remember to run on xgboost process per machine and this will give you maximum speedup
|
|
||||||
* For more information about rabit and how it works, see the [Rabit's Tutorial](https://github.com/dmlc/rabit/tree/master/guide)
|
|
||||||
|
|
||||||
Solvers
|
|
||||||
=====
|
|
||||||
* Column-based solver split data by column, each node work on subset of columns,
|
|
||||||
it uses exactly the same algorithm as single node version.
|
|
||||||
* Row-based solver split data by row, each node work on subset of rows,
|
|
||||||
it uses an approximate histogram count algorithm, and will only examine subset of
|
|
||||||
potential split points as opposed to all split points.
|
|
||||||
- This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
|
|
||||||
@ -1,19 +0,0 @@
|
|||||||
Distributed XGBoost: Column Split Version
|
|
||||||
====
|
|
||||||
* run ```bash mushroom-col-rabit.sh <n-process>```
|
|
||||||
- mushroom-col-rabit.sh starts xgboost job using rabit's allreduce
|
|
||||||
* run ```bash mushroom-col-rabit-mock.sh <n-process>```
|
|
||||||
- mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery
|
|
||||||
|
|
||||||
How to Use
|
|
||||||
====
|
|
||||||
* First split the data by column,
|
|
||||||
* In the config, specify data file as containing a wildcard %d, where %d is the rank of the node, each node will load their part of data
|
|
||||||
* Enable column split mode by ```dsplit=col```
|
|
||||||
|
|
||||||
Notes
|
|
||||||
====
|
|
||||||
* The code is multi-threaded, so you want to run one process per node
|
|
||||||
* The code will work correctly as long as union of each column subset is all the columns we are interested in.
|
|
||||||
- The column subset can overlap with each other.
|
|
||||||
* It uses exactly the same algorithm as single node version, to examine all potential split points.
|
|
||||||
@ -1,25 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
#
|
|
||||||
# This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi
|
|
||||||
# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
|
|
||||||
#
|
|
||||||
rm -rf train.col* *.model
|
|
||||||
k=$1
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost.mock mushroom-col.conf dsplit=col mock=0,2,0,0 mock=1,2,0,0 mock=2,2,8,0 mock=2,3,0,0
|
|
||||||
|
|
||||||
# the model can be directly loaded by single machine xgboost solver, as usuall
|
|
||||||
#../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
|
|
||||||
|
|
||||||
|
|
||||||
#cat dump.nice.$k.txt
|
|
||||||
@ -1,28 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
#
|
|
||||||
# This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi
|
|
||||||
# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
|
|
||||||
#
|
|
||||||
rm -rf train.col* *.model
|
|
||||||
k=$1
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf dsplit=col
|
|
||||||
|
|
||||||
# the model can be directly loaded by single machine xgboost solver, as usuall
|
|
||||||
../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
|
|
||||||
|
|
||||||
# run for one round, and continue training
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf dsplit=col num_round=1
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf mushroom-col.conf dsplit=col model_in=0001.model
|
|
||||||
|
|
||||||
cat dump.nice.$k.txt
|
|
||||||
@ -1,35 +0,0 @@
|
|||||||
# General Parameters, see comment for each definition
|
|
||||||
# choose the booster, can be gbtree or gblinear
|
|
||||||
booster = gbtree
|
|
||||||
# choose logistic regression loss function for binary classification
|
|
||||||
objective = binary:logistic
|
|
||||||
|
|
||||||
# Tree Booster Parameters
|
|
||||||
# step size shrinkage
|
|
||||||
eta = 1.0
|
|
||||||
# minimum loss reduction required to make a further partition
|
|
||||||
gamma = 1.0
|
|
||||||
# minimum sum of instance weight(hessian) needed in a child
|
|
||||||
min_child_weight = 1
|
|
||||||
# maximum depth of a tree
|
|
||||||
max_depth = 3
|
|
||||||
|
|
||||||
# Task Parameters
|
|
||||||
# the number of round to do boosting
|
|
||||||
num_round = 2
|
|
||||||
# 0 means do not save any model except the final round model
|
|
||||||
save_period = 0
|
|
||||||
use_buffer = 0
|
|
||||||
|
|
||||||
# The path of training data %d is the wildcard for the rank of the data
|
|
||||||
# The idea is each process take a feature matrix with subset of columns
|
|
||||||
#
|
|
||||||
data = "train.col%d"
|
|
||||||
|
|
||||||
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
|
||||||
eval[test] = "../../demo/data/agaricus.txt.test"
|
|
||||||
# evaluate on training data as well each round
|
|
||||||
eval_train = 1
|
|
||||||
|
|
||||||
# The path of test data, need to use full data of test, try not use it, or keep an subsampled version
|
|
||||||
test:data = "../../demo/data/agaricus.txt.test"
|
|
||||||
@ -1,32 +0,0 @@
|
|||||||
#!/usr/bin/python
|
|
||||||
import sys
|
|
||||||
import random
|
|
||||||
|
|
||||||
# split libsvm file into different subcolumns
|
|
||||||
if len(sys.argv) < 4:
|
|
||||||
print ('Usage:<fin> <fo> k')
|
|
||||||
exit(0)
|
|
||||||
|
|
||||||
random.seed(10)
|
|
||||||
fmap = {}
|
|
||||||
|
|
||||||
k = int(sys.argv[3])
|
|
||||||
fi = open( sys.argv[1], 'r' )
|
|
||||||
fos = []
|
|
||||||
|
|
||||||
for i in range(k):
|
|
||||||
fos.append(open( sys.argv[2]+'.col%d' % i, 'w' ))
|
|
||||||
|
|
||||||
for l in open(sys.argv[1]):
|
|
||||||
arr = l.split()
|
|
||||||
for f in fos:
|
|
||||||
f.write(arr[0])
|
|
||||||
for it in arr[1:]:
|
|
||||||
fid = int(it.split(':')[0])
|
|
||||||
if fid not in fmap:
|
|
||||||
fmap[fid] = random.randint(0, k-1)
|
|
||||||
fos[fmap[fid]].write(' '+it)
|
|
||||||
for f in fos:
|
|
||||||
f.write('\n')
|
|
||||||
for f in fos:
|
|
||||||
f.close()
|
|
||||||
Loading…
x
Reference in New Issue
Block a user