add parameter tunning

2015-05-01 11:41:18 -07:00
parent 6f0cbcaf2b
commit 2b3b55554f
2 changed files with 46 additions and 0 deletions
--- a/doc/README.md
+++ b/doc/README.md
@@ -6,6 +6,7 @@ List of Documentations
 * [External Memory Version](external_memory.md)
 * [Text input format](input_format.md)
 * [Build Instruction](build.md)
+* [Notes on Parameter Tunning](build.md)
 * [Notes on the Code](../src)
 * List of all parameters and their usage: [Parameters](parameter.md)
 * Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
--- a/doc/param_tuning.md
+++ b/doc/param_tuning.md
@@ -0,0 +1,45 @@
+Notes on Parameter Tuning
+====
+Parmaeter tuning is a dark art in machine learning, the optimal parameters
+of a model can depend on many scenarios. So it is impossible to create a
+comprehensive guides for doing so.
+
+This document tries to provide some guideline for parameters in xgboost.
+
+
+Understanding Bias-Variance Tradeoff
+====
+If you take a machine learning or statistics course, this is likely to be one
+of the most important concepts.
+When we allow the model to get more complicated(e.g. more depth), the model
+have better ability to fit the training data, resulting a less biased model.
+However, such complicated more requires more data to fit.
+
+Most of parameters in xgboost is about bias variance tradeoff. The best model
+should trade the model complexity with its predictive power carefully.
+[Parameters Documentation](parameter.md) will tell you whether each parameter
+will make the model more conservative or not. This can be used to help you
+turn the knob between complicated model and simple model.
+
+Control Overfitting
+====
+When you observe high training accuracy, but low tests accuracy.
+It is likely that you encounter overfitting problem.
+
+There are in general two ways that you can control overfitting in xgboost
+* The first way is to directly control model complexity
+  - This include ```max_depth```, ```min_child_weight``` and ```gamma```
+* The second way is to add randomness to make training robust to noise
+  - This include ```subsample```, ```colsample_bytree```
+  - You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
+  
+Handle Imbalanced Dataset 
+===
+For common caes such as ads clickthrough log. The dataset is extremely imbalanced.
+This can affect the training of xgboost model, and there are two ways to improve it.
+* If you care only about the ranking order (AUC) of your prediction
+  - Balance the positive and negative weight, via ```scale_pos_weight```
+  - Use AUC for evaluation
+* If you care about predicting the right probability
+  - In such case, yuo cannot re-balance the dataset
+  - In such case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence