Merge remote branch 'src/master' into custom_loss_cv_fix

This commit is contained in:
Vadim Khotilovich 2015-05-01 15:42:50 -05:00
commit f05c7d87cb
7 changed files with 110 additions and 49 deletions

View File

@ -22,8 +22,9 @@ Highlights of Usecases: [Highlight Links](doc/README.md#highlight-links)
What's New
==========
* [External Memory Version](doc/external_memory.md)
* XGBoost wins [WWW2015 Microsoft Malware Classification Challenge (BIG 2015)](http://www.kaggle.com/c/malware-classification/forums/t/13490/say-no-to-overfitting-approaches-sharing)
- Checkout the winning solution at [Highlight links](doc/README.md#highlight-links)
* [External Memory Version](doc/external_memory.md)
* XGBoost now support HDFS and S3
* [Distributed XGBoost now runs on YARN](https://github.com/dmlc/wormhole/tree/master/learn/xgboost)
* [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
@ -48,50 +49,10 @@ Features
- It inheritates all the optimizations made in single machine mode, maximumly utilize the resources using both multi-threading and distributed computing.
Build
=====
=======
* Run ```bash build.sh``` (you can also type make)
* If you have C++11 compiler, it is recommended to type ```make cxx11=1```
- C++11 is not used by default
* If your compiler does not come with OpenMP support, it will fire an warning telling you that the code will compile into single thread mode, and you will get single thread xgboost
* You may get a error: -lgomp is not found
- You can type ```make no_omp=1```, this will get you single thread xgboost
- Alternatively, you can upgrade your compiler to compile multi-thread version
* Windows(VS 2010): see [windows](windows) folder
- In principle, you put all the cpp files in the Makefile to the project, and build
* OS X:
- For users who want OpenMP support using [Homebrew](http://brew.sh/), run ```brew update``` (ensures that you install gcc-4.9 or above) and ```brew install gcc --without-multilib```. Once it is installed, edit [Makefile](Makefile/) by replacing:
```
export CC = gcc
export CXX = g++
```
with
```
export CC = gcc-4.9
export CXX = g++-4.9
```
Then run ```bash build.sh``` normally.
- For users who want to use [High Performance Computing for Mac OS X](http://hpc.sourceforge.net/), download the GCC 4.9 binary tar ball and follow the installation guidance to install them under `/usr/local`. Then edit [Makefile](Makefile/) by replacing:
```
export CC = gcc
export CXX = g++
```
with
```
export CC = /usr/local/bin/gcc
export CXX = /usr/local/bin/g++
```
Then run ```bash build.sh``` normally. This solution is given by [Phil Culliton](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/12947/achieve-0-50776-on-the-leaderboard-in-a-minute-with-xgboost/68308#post68308).
Build with HDFS and S3 Support
=====
* To build xgboost use with HDFS/S3 support and distributed learnig. It is recommended to build with dmlc, with the following steps
- ```git clone https://github.com/dmlc/dmlc-core```
- Follow instruction in dmlc-core/make/config.mk to compile libdmlc.a
- In root folder of xgboost, type ```make dmlc=dmlc-core```
* This will allow xgboost to directly load data and save model from/to hdfs and s3
- Simply replace the filename with prefix s3:// or hdfs://
* This xgboost that can be used for distributed learning
- Normally it gives what you want
- See [Build Instruction](doc/build.md) for more information
Version
=======

View File

@ -4,7 +4,8 @@
# This will automatically make xgboost for MAC users who don't have OpenMP support.
# In most cases, type make will give what you want.
# download rabit
# See additional instruction in doc/build.md
if make; then
echo "Successfully build multi-thread xgboost"
@ -15,4 +16,6 @@ else
make clean
make no_omp=1
echo "Successfully build single-thread xgboost"
echo "If you want multi-threaded version"
echo "See additional instructions in doc/build.md"
fi

View File

@ -5,6 +5,8 @@ List of Documentations
* [Learning to use xgboost by example](../demo)
* [External Memory Version](external_memory.md)
* [Text input format](input_format.md)
* [Build Instruction](build.md)
* [Notes on Parameter Tunning](build.md)
* [Notes on the Code](../src)
* List of all parameters and their usage: [Parameters](parameter.md)
* Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
@ -18,10 +20,11 @@ How to get started
Highlight Links
====
This section is about blogposts, presentation and videos discussing how to use xgboost to solve your interesting problem. If you think something belongs to here, send a pull request.
* [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
* [Kaggle Malware Prediction winning solution](https://github.com/xiaozhouwang/kaggle_Microsoft_Malware)
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
* [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
Contribution
====

48
doc/build.md Normal file
View File

@ -0,0 +1,48 @@
Build XGBoost
====
* Run ```bash build.sh``` (you can also type make)
* If you have C++11 compiler, it is recommended to type ```make cxx11=1```
- C++11 is not used by default
* If your compiler does not come with OpenMP support, it will fire an warning telling you that the code will compile into single thread mode, and you will get single thread xgboost
* You may get a error: -lgomp is not found
- You can type ```make no_omp=1```, this will get you single thread xgboost
- Alternatively, you can upgrade your compiler to compile multi-thread version
* Windows(VS 2010): see [../windows](../windows) folder
- In principle, you put all the cpp files in the Makefile to the project, and build
* OS X with multi-threading support: see [next section](#openmp-for-os-x)
OpenMP for OS X
====
* For users who want OpenMP support using [Homebrew](http://brew.sh/), run ```brew update``` (ensures that you install gcc-4.9 or above) and ```brew install gcc --without-multilib```. Once it is installed, edit [../Makefile](../Makefile) by replacing:
```bash
export CC = gcc
export CXX = g++
```
with
```bash
export CC = gcc-4.9
export CXX = g++-4.9
```
Then run ```bash build.sh``` normally.
* For users who want to use [High Performance Computing for Mac OS X](http://hpc.sourceforge.net/), download the GCC 4.9 binary tar ball and follow the installation guidance to install them under `/usr/local`. Then edit [../Makefile](../Makefile) by replacing:
```
export CC = gcc
export CXX = g++
```
with
```
export CC = /usr/local/bin/gcc
export CXX = /usr/local/bin/g++
```
Then run ```bash build.sh``` normally. This solution is given by [Phil Culliton](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/12947/achieve-0-50776-on-the-leaderboard-in-a-minute-with-xgboost/68308#post68308).
Build with HDFS and S3 Support
=====
* To build xgboost use with HDFS/S3 support and distributed learnig. It is recommended to build with dmlc, with the following steps
- ```git clone https://github.com/dmlc/dmlc-core```
- Follow instruction in dmlc-core/make/config.mk to compile libdmlc.a
- In root folder of xgboost, type ```make dmlc=dmlc-core```
* This will allow xgboost to directly load data and save model from/to hdfs and s3
- Simply replace the filename with prefix s3:// or hdfs://
* This xgboost that can be used for distributed learning

45
doc/param_tuning.md Normal file
View File

@ -0,0 +1,45 @@
Notes on Parameter Tuning
====
Parmaeter tuning is a dark art in machine learning, the optimal parameters
of a model can depend on many scenarios. So it is impossible to create a
comprehensive guides for doing so.
This document tries to provide some guideline for parameters in xgboost.
Understanding Bias-Variance Tradeoff
====
If you take a machine learning or statistics course, this is likely to be one
of the most important concepts.
When we allow the model to get more complicated(e.g. more depth), the model
have better ability to fit the training data, resulting a less biased model.
However, such complicated more requires more data to fit.
Most of parameters in xgboost is about bias variance tradeoff. The best model
should trade the model complexity with its predictive power carefully.
[Parameters Documentation](parameter.md) will tell you whether each parameter
will make the model more conservative or not. This can be used to help you
turn the knob between complicated model and simple model.
Control Overfitting
====
When you observe high training accuracy, but low tests accuracy.
It is likely that you encounter overfitting problem.
There are in general two ways that you can control overfitting in xgboost
* The first way is to directly control model complexity
- This include ```max_depth```, ```min_child_weight``` and ```gamma```
* The second way is to add randomness to make training robust to noise
- This include ```subsample```, ```colsample_bytree```
- You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
Handle Imbalanced Dataset
===
For common caes such as ads clickthrough log. The dataset is extremely imbalanced.
This can affect the training of xgboost model, and there are two ways to improve it.
* If you care only about the ranking order (AUC) of your prediction
- Balance the positive and negative weight, via ```scale_pos_weight```
- Use AUC for evaluation
* If you care about predicting the right probability
- In such case, yuo cannot re-balance the dataset
- In such case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence

View File

@ -136,6 +136,7 @@ class SparsePage {
for (size_t i = 0; i < disk_offset_.size(); ++i) {
offset[i + begin] = top + disk_offset_[i];
}
return true;
}
/*!
* \brief Push row batch into the page

View File

@ -4,7 +4,7 @@ This folder provides wrapper of xgboost to other languages
Python
=====
* To make the python module, type ```make``` in the root directory of project
* To make the python module, type ```./build.sh``` in the root directory of project
* Install with `python setup.py install` from this directory.
* Refer also to the walk through example in [demo folder](../demo/guide-python)