diff --git a/doc/README.md b/doc/README.md index a75d256c4..a56d573e5 100644 --- a/doc/README.md +++ b/doc/README.md @@ -6,6 +6,7 @@ List of Documentations * [External Memory Version](external_memory.md) * [Text input format](input_format.md) * [Build Instruction](build.md) +* [Notes on Parameter Tunning](build.md) * [Notes on the Code](../src) * List of all parameters and their usage: [Parameters](parameter.md) * Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) diff --git a/doc/param_tuning.md b/doc/param_tuning.md new file mode 100644 index 000000000..cb4ce11f8 --- /dev/null +++ b/doc/param_tuning.md @@ -0,0 +1,45 @@ +Notes on Parameter Tuning +==== +Parmaeter tuning is a dark art in machine learning, the optimal parameters +of a model can depend on many scenarios. So it is impossible to create a +comprehensive guides for doing so. + +This document tries to provide some guideline for parameters in xgboost. + + +Understanding Bias-Variance Tradeoff +==== +If you take a machine learning or statistics course, this is likely to be one +of the most important concepts. +When we allow the model to get more complicated(e.g. more depth), the model +have better ability to fit the training data, resulting a less biased model. +However, such complicated more requires more data to fit. + +Most of parameters in xgboost is about bias variance tradeoff. The best model +should trade the model complexity with its predictive power carefully. +[Parameters Documentation](parameter.md) will tell you whether each parameter +will make the model more conservative or not. This can be used to help you +turn the knob between complicated model and simple model. + +Control Overfitting +==== +When you observe high training accuracy, but low tests accuracy. +It is likely that you encounter overfitting problem. + +There are in general two ways that you can control overfitting in xgboost +* The first way is to directly control model complexity + - This include ```max_depth```, ```min_child_weight``` and ```gamma``` +* The second way is to add randomness to make training robust to noise + - This include ```subsample```, ```colsample_bytree``` + - You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so. + +Handle Imbalanced Dataset +=== +For common caes such as ads clickthrough log. The dataset is extremely imbalanced. +This can affect the training of xgboost model, and there are two ways to improve it. +* If you care only about the ranking order (AUC) of your prediction + - Balance the positive and negative weight, via ```scale_pos_weight``` + - Use AUC for evaluation +* If you care about predicting the right probability + - In such case, yuo cannot re-balance the dataset + - In such case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence