xgboost/doc/tutorials/param_tuning.rst
Philip Hyunsu Cho 05b089405d
Doc modernization (#3474)
* Change doc build to reST exclusively

* Rewrite Intro doc in reST; create toctree

* Update parameter and contribute

* Convert tutorials to reST

* Convert Python tutorials to reST

* Convert CLI and Julia docs to reST

* Enable markdown for R vignettes

* Done migrating to reST

* Add guzzle_sphinx_theme to requirements

* Add breathe to requirements

* Fix search bar

* Add link to user forum
2018-07-19 14:22:16 -07:00

56 lines
2.3 KiB
ReStructuredText

#########################
Notes on Parameter Tuning
#########################
Parameter tuning is a dark art in machine learning, the optimal parameters
of a model can depend on many scenarios. So it is impossible to create a
comprehensive guide for doing so.
This document tries to provide some guideline for parameters in XGBoost.
************************************
Understanding Bias-Variance Tradeoff
************************************
If you take a machine learning or statistics course, this is likely to be one
of the most important concepts.
When we allow the model to get more complicated (e.g. more depth), the model
has better ability to fit the training data, resulting in a less biased model.
However, such complicated model requires more data to fit.
Most of parameters in XGBoost are about bias variance tradeoff. The best model
should trade the model complexity with its predictive power carefully.
:doc:`Parameters Documentation </parameter>` will tell you whether each parameter
ill make the model more conservative or not. This can be used to help you
turn the knob between complicated model and simple model.
*******************
Control Overfitting
*******************
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.
There are in general two ways that you can control overfitting in XGBoost
* The first way is to directly control model complexity
- This include ``max_depth``, ``min_child_weight`` and ``gamma``
* The second way is to add randomness to make training robust to noise
- This include ``subsample`` and ``colsample_bytree``.
- You can also reduce stepsize ``eta``. Rremember to increase ``num_round`` when you do so.
*************************
Handle Imbalanced Dataset
*************************
For common cases such as ads clickthrough log, the dataset is extremely imbalanced.
This can affect the training of XGBoost model, and there are two ways to improve it.
* If you care only about the overall performance metric (AUC) of your prediction
- Balance the positive and negative weights via ``scale_pos_weight``
- Use AUC for evaluation
* If you care about predicting the right probability
- In such a case, you cannot re-balance the dataset
- Set parameter ``max_delta_step`` to a finite number (say 1) to help convergence