diff --git a/CHANGES.md b/NEWS.md similarity index 91% rename from CHANGES.md rename to NEWS.md index 441ab8461..e9c89da00 100644 --- a/CHANGES.md +++ b/NEWS.md @@ -1,43 +1,30 @@ -Change Log -========== +XGBoost Change Log +================== -xgboost-0.1 ------------ -* Initial release +This file records the chanegs in xgboost library in reverse chronological order. -xgboost-0.2x ------------- -* Python module -* Weighted samples instances -* Initial version of pairwise rank +## brick: next release candidate +* Major refactor of core library. + - Goal: more flexible and modular code as a portable library. + - Switch to use of c++11 standard code. + - Random number generator defaults to ```std::mt19937```. + - Share the data loading pipeline and logging module from dmlc-core. + - Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader. + - Future plugin modules can be put into xgboost/plugin and register back to the library. + - Remove most of the raw pointers to smart ptrs, for RAII safety. +* Change library name to libxgboost.so +* Backward compatiblity + - The binary buffer file is not backward compatible with previous version. + - The model file is backward compatible on 64 bit platforms. +* The model file is compatible between 64/32 bit platforms(not yet tested). +* External memory version and other advanced features will be exposed to R library as well on linux. + - Previously some of the features are blocked due to C++11 and threading limits. + - The windows version is still blocked due to Rtools do not support ```std::thread```. +* rabit and dmlc-core are maintained through git submodule + - Anyone can open PR to update these dependencies now. -xgboost-0.3 ------------ -* Faster tree construction module - - Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio``` -* Support for boosting from initial predictions -* Experimental version of LambdaRank -* Linear booster is now parallelized, using parallel coordinated descent. -* Add [Code Guide](src/README.md) for customizing objective function and evaluation -* Add R module +## v0.47 (2016.01.14) -xgboost-0.4 ------------ -* Distributed version of xgboost that runs on YARN, scales to billions of examples -* Direct save/load data and model from/to S3 and HDFS -* Feature importance visualization in R module, by Michael Benesty -* Predict leaf index -* Poisson regression for counts data -* Early stopping option in training -* Native save load support in R and python - - xgboost models now can be saved using save/load in R - - xgboost python model is now pickable -* sklearn wrapper is supported in python module -* Experimental External memory version - - -xgboost-0.47 ------------- * Changes in R library - fixed possible problem of poisson regression. - switched from 0 to NA for missing values. @@ -58,23 +45,39 @@ xgboost-0.47 * Java api is ready for use * Added more test cases and continuous integration to make each build more robust. -xgboost brick: next release candidate -------------------------------------- -* Major refactor of core library. - - Goal: more flexible and modular code as a portable library. - - Switch to use of c++11 standard code. - - Random number generator defaults to ```std::mt19937```. - - Share the data loading pipeline and logging module from dmlc-core. - - Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader. - - Future plugin modules can be put into xgboost/plugin and register back to the library. - - Remove most of the raw pointers to smart ptrs, for RAII safety. -* Change library name to libxgboost.so -* Backward compatiblity - - The binary buffer file is not backward compatible with previous version. - - The model file is backward compatible on 64 bit platforms. -* The model file is compatible between 64/32 bit platforms(not yet tested). -* External memory version and other advanced features will be exposed to R library as well on linux. - - Previously some of the features are blocked due to C++11 and threading limits. - - The windows version is still blocked due to Rtools do not support ```std::thread```. -* rabit and dmlc-core are maintained through git submodule - - Anyone can open PR to update these dependencies now. +## v0.4 (2015.05.11) + +* Distributed version of xgboost that runs on YARN, scales to billions of examples +* Direct save/load data and model from/to S3 and HDFS +* Feature importance visualization in R module, by Michael Benesty +* Predict leaf index +* Poisson regression for counts data +* Early stopping option in training +* Native save load support in R and python + - xgboost models now can be saved using save/load in R + - xgboost python model is now pickable +* sklearn wrapper is supported in python module +* Experimental External memory version + + +## v0.3 (2014.09.07) + +* Faster tree construction module + - Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio``` +* Support for boosting from initial predictions +* Experimental version of LambdaRank +* Linear booster is now parallelized, using parallel coordinated descent. +* Add [Code Guide](src/README.md) for customizing objective function and evaluation +* Add R module + + +## v0.2x (2014.05.20) + +* Python module +* Weighted samples instances +* Initial version of pairwise rank + + +## v0.1 (2014.03.26) + +* Initial release \ No newline at end of file diff --git a/README.md b/README.md index 0586abcae..46a9f8537 100644 --- a/README.md +++ b/README.md @@ -15,23 +15,14 @@ XGBoost is part of [DMLC](http://dmlc.github.io/) projects. Contents -------- -* [Documentation](https://xgboost.readthedocs.org) -* [Usecases](doc/index.md#highlight-links) +* [Documentation and Tutorials](https://xgboost.readthedocs.org) * [Code Examples](demo) * [Build Instruction](doc/build.md) * [Committers and Contributors](CONTRIBUTORS.md) What's New ---------- -* XGBoost [brick](CHANGES.md) -* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/). -* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/). -* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/). - -Version -------- -* Current version xgboost-0.6 (brick) - - See [Change log](CHANGES.md) for details +* [XGBoost brick](NEWS.md) Release Features -------- @@ -45,17 +36,16 @@ Features Bug Reporting ------------- - * For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page. * For generic questions or to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) - Contributing to XGBoost ----------------------- XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users. * Check out [Feature Wish List](https://github.com/dmlc/xgboost/labels/Wish-List) to see what can be improved, or open an issue if you want something. * Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users. -* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) after your patch has been merged. +* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) and after your patch has been merged. + - Please also update [NEWS.md](NEWS.md) on changes and improvements in API and docs. License ------- diff --git a/demo/README.md b/demo/README.md index 5a7a25f76..229ffc6ff 100644 --- a/demo/README.md +++ b/demo/README.md @@ -44,8 +44,15 @@ However, the parameter settings can be applied to all versions * [Multiclass classification](multiclass_classification) * [Regression](regression) * [Learning to Rank](rank) +* [Distributed Training](distributed-training) Benchmarks ---------- * [Starter script for Kaggle Higgs Boson](kaggle-higgs) * [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution) + +Machine Learning Challenge Winning Solutions +-------------------------------------------- +* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/). +* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/). +* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/). diff --git a/demo/distributed-training/README.md b/demo/distributed-training/README.md new file mode 100644 index 000000000..3926612cc --- /dev/null +++ b/demo/distributed-training/README.md @@ -0,0 +1,52 @@ +Distributed XGBoost Training +============================ +This is an tutorial of Distributed XGBoost Training. +Currently xgboost supports distributed training via CLI program with the configuration file. +There is also plan push distributed python and other language bindings, please open an issue +if you are interested in contributing. + +Build XGBoost with Distributed Filesystem Support +------------------------------------------------- +To use distributed xgboost, you only need to turn the options on to build +with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```. + +How to Use +---------- +* Input data format: LIBSVM format. The example here uses generated data in ../data folder. +* Put the data into some distribute filesytem (S3 or HDFS) +* Use tracker script in dmlc-core/tracker to submit the jobs +* Like all other DMLC tools, xgboost support taking a path to a folder as input argument + - All the files in the folder will be used as input +* Quick start in Hadoop YARN: run ```bash run_yarn.sh ``` + +Example +------- +* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN. + +Single machine vs Distributed Version +------------------------------------- +If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file. +* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access +* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file + - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf". + - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` + - The local path of cached files in command is "./". +* More details of submission can be referred to the usage of ```dmlc_yarn.py```. +* The model saved by hadoop version is compatible with single machine version. + +Notes +----- +* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance. + - You will want to set to be number of cores you have on each machine. + + +External Memory Version +----------------------- +XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data. +See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory. + +You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as +``` +data=hdfs:///path-to-my-data/#dtrain.cache +``` +This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset. diff --git a/demo/distributed-training/run_yarn.sh b/demo/distributed-training/run_yarn.sh new file mode 100755 index 000000000..3d7c6bf05 --- /dev/null +++ b/demo/distributed-training/run_yarn.sh @@ -0,0 +1,33 @@ +#!/bin/bash +if [ "$#" -lt 3 ]; +then + echo "Usage: " + exit -1 +fi + +# put the local training file to HDFS +hadoop fs -mkdir $3/data +hadoop fs -put ../data/agaricus.txt.train $3/data +hadoop fs -put ../data/agaricus.txt.test $3/data + +# running rabit, pass address in hdfs +../../dmlc-core/tracker/dmlc_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\ + data=hdfs://$3/data/agaricus.txt.train\ + eval[test]=hdfs://$3/data/agaricus.txt.test\ + model_out=hdfs://$3/mushroom.final.model + +# get the final model file +hadoop fs -get $3/mushroom.final.model final.model + +# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env + +# output prediction task=pred +#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test +../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test +# print the boosters of final.model in dump.raw.txt +#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt +../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt +# use the feature map in printing for better visualization +#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt +../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt +cat dump.nice.txt diff --git a/multi-node/README.md b/multi-node/README.md deleted file mode 100644 index 593a7d3c8..000000000 --- a/multi-node/README.md +++ /dev/null @@ -1,28 +0,0 @@ -Distributed XGBoost -====== -Distributed XGBoost is now part of [Wormhole](https://github.com/dmlc/wormhole). -Checkout this [Link](https://github.com/dmlc/wormhole/tree/master/learn/xgboost) for usage examples, build and job submissions. -* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit) - - Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning - - This makes xgboost portable and fault-tolerant against node failures - -Notes -==== -* Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs - - The Hadoop version does not rely on Mapreduce to do iterations - - You can expect xgboost not suffering the drawbacks of iterative MapReduce program -* The design choice was made because Allreduce is very natural and efficient for distributed tree building - - In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code -* The multi-threading nature of xgboost is inheritated in distributed mode - - This means xgboost efficiently use all the threads in one machine, and communicates only between machines - - Remember to run on xgboost process per machine and this will give you maximum speedup -* For more information about rabit and how it works, see the [Rabit's Tutorial](https://github.com/dmlc/rabit/tree/master/guide) - -Solvers -===== -* Column-based solver split data by column, each node work on subset of columns, - it uses exactly the same algorithm as single node version. -* Row-based solver split data by row, each node work on subset of rows, - it uses an approximate histogram count algorithm, and will only examine subset of - potential split points as opposed to all split points. - - This is the mode used by current hadoop version, since usually data was stored by rows in many industry system diff --git a/multi-node/col-split/README.md b/multi-node/col-split/README.md deleted file mode 100644 index 3ea0799fe..000000000 --- a/multi-node/col-split/README.md +++ /dev/null @@ -1,19 +0,0 @@ -Distributed XGBoost: Column Split Version -==== -* run ```bash mushroom-col-rabit.sh ``` - - mushroom-col-rabit.sh starts xgboost job using rabit's allreduce -* run ```bash mushroom-col-rabit-mock.sh ``` - - mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery - -How to Use -==== -* First split the data by column, -* In the config, specify data file as containing a wildcard %d, where %d is the rank of the node, each node will load their part of data -* Enable column split mode by ```dsplit=col``` - -Notes -==== -* The code is multi-threaded, so you want to run one process per node -* The code will work correctly as long as union of each column subset is all the columns we are interested in. - - The column subset can overlap with each other. -* It uses exactly the same algorithm as single node version, to examine all potential split points. diff --git a/multi-node/col-split/mushroom-col-rabit-mock.sh b/multi-node/col-split/mushroom-col-rabit-mock.sh deleted file mode 100755 index b4208f04c..000000000 --- a/multi-node/col-split/mushroom-col-rabit-mock.sh +++ /dev/null @@ -1,25 +0,0 @@ -#!/bin/bash -if [[ $# -ne 1 ]] -then - echo "Usage: nprocess" - exit -1 -fi - -# -# This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi -# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py -# -rm -rf train.col* *.model -k=$1 - -# split the lib svm file into k subfiles -python splitsvm.py ../../demo/data/agaricus.txt.train train $k - -# run xgboost mpi -../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost.mock mushroom-col.conf dsplit=col mock=0,2,0,0 mock=1,2,0,0 mock=2,2,8,0 mock=2,3,0,0 - -# the model can be directly loaded by single machine xgboost solver, as usuall -#../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt - - -#cat dump.nice.$k.txt diff --git a/multi-node/col-split/mushroom-col-rabit.sh b/multi-node/col-split/mushroom-col-rabit.sh deleted file mode 100755 index 77e0c904c..000000000 --- a/multi-node/col-split/mushroom-col-rabit.sh +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/bash -if [[ $# -ne 1 ]] -then - echo "Usage: nprocess" - exit -1 -fi - -# -# This script is same as mushroom-col except that we will be using xgboost instead of xgboost-mpi -# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py -# -rm -rf train.col* *.model -k=$1 - -# split the lib svm file into k subfiles -python splitsvm.py ../../demo/data/agaricus.txt.train train $k - -# run xgboost mpi -../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf dsplit=col - -# the model can be directly loaded by single machine xgboost solver, as usuall -../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt - -# run for one round, and continue training -../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf dsplit=col num_round=1 -../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost mushroom-col.conf mushroom-col.conf dsplit=col model_in=0001.model - -cat dump.nice.$k.txt diff --git a/multi-node/col-split/mushroom-col.conf b/multi-node/col-split/mushroom-col.conf deleted file mode 100644 index 2c779a44d..000000000 --- a/multi-node/col-split/mushroom-col.conf +++ /dev/null @@ -1,35 +0,0 @@ -# General Parameters, see comment for each definition -# choose the booster, can be gbtree or gblinear -booster = gbtree -# choose logistic regression loss function for binary classification -objective = binary:logistic - -# Tree Booster Parameters -# step size shrinkage -eta = 1.0 -# minimum loss reduction required to make a further partition -gamma = 1.0 -# minimum sum of instance weight(hessian) needed in a child -min_child_weight = 1 -# maximum depth of a tree -max_depth = 3 - -# Task Parameters -# the number of round to do boosting -num_round = 2 -# 0 means do not save any model except the final round model -save_period = 0 -use_buffer = 0 - -# The path of training data %d is the wildcard for the rank of the data -# The idea is each process take a feature matrix with subset of columns -# -data = "train.col%d" - -# The path of validation data, used to monitor training process, here [test] sets name of the validation set -eval[test] = "../../demo/data/agaricus.txt.test" -# evaluate on training data as well each round -eval_train = 1 - -# The path of test data, need to use full data of test, try not use it, or keep an subsampled version -test:data = "../../demo/data/agaricus.txt.test" diff --git a/multi-node/col-split/splitsvm.py b/multi-node/col-split/splitsvm.py deleted file mode 100644 index 365aef610..000000000 --- a/multi-node/col-split/splitsvm.py +++ /dev/null @@ -1,32 +0,0 @@ -#!/usr/bin/python -import sys -import random - -# split libsvm file into different subcolumns -if len(sys.argv) < 4: - print ('Usage: k') - exit(0) - -random.seed(10) -fmap = {} - -k = int(sys.argv[3]) -fi = open( sys.argv[1], 'r' ) -fos = [] - -for i in range(k): - fos.append(open( sys.argv[2]+'.col%d' % i, 'w' )) - -for l in open(sys.argv[1]): - arr = l.split() - for f in fos: - f.write(arr[0]) - for it in arr[1:]: - fid = int(it.split(':')[0]) - if fid not in fmap: - fmap[fid] = random.randint(0, k-1) - fos[fmap[fid]].write(' '+it) - for f in fos: - f.write('\n') -for f in fos: - f.close()