Merge remote branch 'src/master' into custom_loss_cv_fix
This commit is contained in:
commit
f05c7d87cb
49
README.md
49
README.md
@ -22,8 +22,9 @@ Highlights of Usecases: [Highlight Links](doc/README.md#highlight-links)
|
||||
|
||||
What's New
|
||||
==========
|
||||
* [External Memory Version](doc/external_memory.md)
|
||||
* XGBoost wins [WWW2015 Microsoft Malware Classification Challenge (BIG 2015)](http://www.kaggle.com/c/malware-classification/forums/t/13490/say-no-to-overfitting-approaches-sharing)
|
||||
- Checkout the winning solution at [Highlight links](doc/README.md#highlight-links)
|
||||
* [External Memory Version](doc/external_memory.md)
|
||||
* XGBoost now support HDFS and S3
|
||||
* [Distributed XGBoost now runs on YARN](https://github.com/dmlc/wormhole/tree/master/learn/xgboost)
|
||||
* [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
|
||||
@ -48,50 +49,10 @@ Features
|
||||
- It inheritates all the optimizations made in single machine mode, maximumly utilize the resources using both multi-threading and distributed computing.
|
||||
|
||||
Build
|
||||
=====
|
||||
=======
|
||||
* Run ```bash build.sh``` (you can also type make)
|
||||
* If you have C++11 compiler, it is recommended to type ```make cxx11=1```
|
||||
- C++11 is not used by default
|
||||
* If your compiler does not come with OpenMP support, it will fire an warning telling you that the code will compile into single thread mode, and you will get single thread xgboost
|
||||
* You may get a error: -lgomp is not found
|
||||
- You can type ```make no_omp=1```, this will get you single thread xgboost
|
||||
- Alternatively, you can upgrade your compiler to compile multi-thread version
|
||||
* Windows(VS 2010): see [windows](windows) folder
|
||||
- In principle, you put all the cpp files in the Makefile to the project, and build
|
||||
* OS X:
|
||||
- For users who want OpenMP support using [Homebrew](http://brew.sh/), run ```brew update``` (ensures that you install gcc-4.9 or above) and ```brew install gcc --without-multilib```. Once it is installed, edit [Makefile](Makefile/) by replacing:
|
||||
```
|
||||
export CC = gcc
|
||||
export CXX = g++
|
||||
```
|
||||
with
|
||||
```
|
||||
export CC = gcc-4.9
|
||||
export CXX = g++-4.9
|
||||
```
|
||||
Then run ```bash build.sh``` normally.
|
||||
|
||||
- For users who want to use [High Performance Computing for Mac OS X](http://hpc.sourceforge.net/), download the GCC 4.9 binary tar ball and follow the installation guidance to install them under `/usr/local`. Then edit [Makefile](Makefile/) by replacing:
|
||||
```
|
||||
export CC = gcc
|
||||
export CXX = g++
|
||||
```
|
||||
with
|
||||
```
|
||||
export CC = /usr/local/bin/gcc
|
||||
export CXX = /usr/local/bin/g++
|
||||
```
|
||||
Then run ```bash build.sh``` normally. This solution is given by [Phil Culliton](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/12947/achieve-0-50776-on-the-leaderboard-in-a-minute-with-xgboost/68308#post68308).
|
||||
|
||||
Build with HDFS and S3 Support
|
||||
=====
|
||||
* To build xgboost use with HDFS/S3 support and distributed learnig. It is recommended to build with dmlc, with the following steps
|
||||
- ```git clone https://github.com/dmlc/dmlc-core```
|
||||
- Follow instruction in dmlc-core/make/config.mk to compile libdmlc.a
|
||||
- In root folder of xgboost, type ```make dmlc=dmlc-core```
|
||||
* This will allow xgboost to directly load data and save model from/to hdfs and s3
|
||||
- Simply replace the filename with prefix s3:// or hdfs://
|
||||
* This xgboost that can be used for distributed learning
|
||||
- Normally it gives what you want
|
||||
- See [Build Instruction](doc/build.md) for more information
|
||||
|
||||
Version
|
||||
=======
|
||||
|
||||
5
build.sh
5
build.sh
@ -4,7 +4,8 @@
|
||||
# This will automatically make xgboost for MAC users who don't have OpenMP support.
|
||||
# In most cases, type make will give what you want.
|
||||
|
||||
# download rabit
|
||||
# See additional instruction in doc/build.md
|
||||
|
||||
|
||||
if make; then
|
||||
echo "Successfully build multi-thread xgboost"
|
||||
@ -15,4 +16,6 @@ else
|
||||
make clean
|
||||
make no_omp=1
|
||||
echo "Successfully build single-thread xgboost"
|
||||
echo "If you want multi-threaded version"
|
||||
echo "See additional instructions in doc/build.md"
|
||||
fi
|
||||
|
||||
@ -5,6 +5,8 @@ List of Documentations
|
||||
* [Learning to use xgboost by example](../demo)
|
||||
* [External Memory Version](external_memory.md)
|
||||
* [Text input format](input_format.md)
|
||||
* [Build Instruction](build.md)
|
||||
* [Notes on Parameter Tunning](build.md)
|
||||
* [Notes on the Code](../src)
|
||||
* List of all parameters and their usage: [Parameters](parameter.md)
|
||||
* Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
|
||||
@ -18,10 +20,11 @@ How to get started
|
||||
Highlight Links
|
||||
====
|
||||
This section is about blogposts, presentation and videos discussing how to use xgboost to solve your interesting problem. If you think something belongs to here, send a pull request.
|
||||
* [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
|
||||
* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
|
||||
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
|
||||
* [Kaggle Malware Prediction winning solution](https://github.com/xiaozhouwang/kaggle_Microsoft_Malware)
|
||||
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
|
||||
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
|
||||
* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
|
||||
* [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
|
||||
|
||||
Contribution
|
||||
====
|
||||
|
||||
48
doc/build.md
Normal file
48
doc/build.md
Normal file
@ -0,0 +1,48 @@
|
||||
Build XGBoost
|
||||
====
|
||||
* Run ```bash build.sh``` (you can also type make)
|
||||
* If you have C++11 compiler, it is recommended to type ```make cxx11=1```
|
||||
- C++11 is not used by default
|
||||
* If your compiler does not come with OpenMP support, it will fire an warning telling you that the code will compile into single thread mode, and you will get single thread xgboost
|
||||
* You may get a error: -lgomp is not found
|
||||
- You can type ```make no_omp=1```, this will get you single thread xgboost
|
||||
- Alternatively, you can upgrade your compiler to compile multi-thread version
|
||||
* Windows(VS 2010): see [../windows](../windows) folder
|
||||
- In principle, you put all the cpp files in the Makefile to the project, and build
|
||||
* OS X with multi-threading support: see [next section](#openmp-for-os-x)
|
||||
|
||||
OpenMP for OS X
|
||||
====
|
||||
* For users who want OpenMP support using [Homebrew](http://brew.sh/), run ```brew update``` (ensures that you install gcc-4.9 or above) and ```brew install gcc --without-multilib```. Once it is installed, edit [../Makefile](../Makefile) by replacing:
|
||||
```bash
|
||||
export CC = gcc
|
||||
export CXX = g++
|
||||
```
|
||||
with
|
||||
```bash
|
||||
export CC = gcc-4.9
|
||||
export CXX = g++-4.9
|
||||
```
|
||||
Then run ```bash build.sh``` normally.
|
||||
|
||||
* For users who want to use [High Performance Computing for Mac OS X](http://hpc.sourceforge.net/), download the GCC 4.9 binary tar ball and follow the installation guidance to install them under `/usr/local`. Then edit [../Makefile](../Makefile) by replacing:
|
||||
```
|
||||
export CC = gcc
|
||||
export CXX = g++
|
||||
```
|
||||
with
|
||||
```
|
||||
export CC = /usr/local/bin/gcc
|
||||
export CXX = /usr/local/bin/g++
|
||||
```
|
||||
Then run ```bash build.sh``` normally. This solution is given by [Phil Culliton](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/12947/achieve-0-50776-on-the-leaderboard-in-a-minute-with-xgboost/68308#post68308).
|
||||
|
||||
Build with HDFS and S3 Support
|
||||
=====
|
||||
* To build xgboost use with HDFS/S3 support and distributed learnig. It is recommended to build with dmlc, with the following steps
|
||||
- ```git clone https://github.com/dmlc/dmlc-core```
|
||||
- Follow instruction in dmlc-core/make/config.mk to compile libdmlc.a
|
||||
- In root folder of xgboost, type ```make dmlc=dmlc-core```
|
||||
* This will allow xgboost to directly load data and save model from/to hdfs and s3
|
||||
- Simply replace the filename with prefix s3:// or hdfs://
|
||||
* This xgboost that can be used for distributed learning
|
||||
45
doc/param_tuning.md
Normal file
45
doc/param_tuning.md
Normal file
@ -0,0 +1,45 @@
|
||||
Notes on Parameter Tuning
|
||||
====
|
||||
Parmaeter tuning is a dark art in machine learning, the optimal parameters
|
||||
of a model can depend on many scenarios. So it is impossible to create a
|
||||
comprehensive guides for doing so.
|
||||
|
||||
This document tries to provide some guideline for parameters in xgboost.
|
||||
|
||||
|
||||
Understanding Bias-Variance Tradeoff
|
||||
====
|
||||
If you take a machine learning or statistics course, this is likely to be one
|
||||
of the most important concepts.
|
||||
When we allow the model to get more complicated(e.g. more depth), the model
|
||||
have better ability to fit the training data, resulting a less biased model.
|
||||
However, such complicated more requires more data to fit.
|
||||
|
||||
Most of parameters in xgboost is about bias variance tradeoff. The best model
|
||||
should trade the model complexity with its predictive power carefully.
|
||||
[Parameters Documentation](parameter.md) will tell you whether each parameter
|
||||
will make the model more conservative or not. This can be used to help you
|
||||
turn the knob between complicated model and simple model.
|
||||
|
||||
Control Overfitting
|
||||
====
|
||||
When you observe high training accuracy, but low tests accuracy.
|
||||
It is likely that you encounter overfitting problem.
|
||||
|
||||
There are in general two ways that you can control overfitting in xgboost
|
||||
* The first way is to directly control model complexity
|
||||
- This include ```max_depth```, ```min_child_weight``` and ```gamma```
|
||||
* The second way is to add randomness to make training robust to noise
|
||||
- This include ```subsample```, ```colsample_bytree```
|
||||
- You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
|
||||
|
||||
Handle Imbalanced Dataset
|
||||
===
|
||||
For common caes such as ads clickthrough log. The dataset is extremely imbalanced.
|
||||
This can affect the training of xgboost model, and there are two ways to improve it.
|
||||
* If you care only about the ranking order (AUC) of your prediction
|
||||
- Balance the positive and negative weight, via ```scale_pos_weight```
|
||||
- Use AUC for evaluation
|
||||
* If you care about predicting the right probability
|
||||
- In such case, yuo cannot re-balance the dataset
|
||||
- In such case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence
|
||||
@ -136,6 +136,7 @@ class SparsePage {
|
||||
for (size_t i = 0; i < disk_offset_.size(); ++i) {
|
||||
offset[i + begin] = top + disk_offset_[i];
|
||||
}
|
||||
return true;
|
||||
}
|
||||
/*!
|
||||
* \brief Push row batch into the page
|
||||
|
||||
@ -4,7 +4,7 @@ This folder provides wrapper of xgboost to other languages
|
||||
|
||||
Python
|
||||
=====
|
||||
* To make the python module, type ```make``` in the root directory of project
|
||||
* To make the python module, type ```./build.sh``` in the root directory of project
|
||||
* Install with `python setup.py install` from this directory.
|
||||
* Refer also to the walk through example in [demo folder](../demo/guide-python)
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user