[DOC] refactor doc
This commit is contained in:
145
doc/how_to/contribute.md
Normal file
145
doc/how_to/contribute.md
Normal file
@@ -0,0 +1,145 @@
|
||||
Contribute to XGBoost
|
||||
=====================
|
||||
XGBoost has been developed and used by a group of active community members.
|
||||
Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
|
||||
|
||||
- Please add your name to [CONTRIBUTORS.md](../CONTRIBUTORS.md) after your patch has been merged.
|
||||
- Please also update [NEWS.md](../NEWS.md) to add note on your changes to the API or added a new document.
|
||||
|
||||
Guidelines
|
||||
----------
|
||||
* [Submit Pull Request](#submit-pull-request)
|
||||
* [Git Workflow Howtos](#git-workflow-howtos)
|
||||
- [How to resolve conflict with master](#how-to-resolve-conflict-with-master)
|
||||
- [How to combine multiple commits into one](#how-to-combine-multiple-commits-into-one)
|
||||
- [What is the consequence of force push](#what-is-the-consequence-of-force-push)
|
||||
* [Document](#document)
|
||||
* [Testcases](#testcases)
|
||||
* [Examples](#examples)
|
||||
* [Core Library](#core-library)
|
||||
* [Python Package](#python-package)
|
||||
* [R Package](#r-package)
|
||||
|
||||
Submit Pull Request
|
||||
-------------------
|
||||
* Before submit, please rebase your code on the most recent version of master, you can do it by
|
||||
```bash
|
||||
git remote add upstream https://github.com/dmlc/xgboost
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
```
|
||||
* If you have multiple small commits,
|
||||
it might be good to merge them together(use git rebase then squash) into more meaningful groups.
|
||||
* Send the pull request!
|
||||
- Fix the problems reported by automatic checks
|
||||
- If you are contributing a new module, consider add a testcase in [tests](../tests)
|
||||
|
||||
Git Workflow Howtos
|
||||
-------------------
|
||||
### How to resolve conflict with master
|
||||
- First rebase to most recent master
|
||||
```bash
|
||||
# The first two steps can be skipped after you do it once.
|
||||
git remote add upstream https://github.com/dmlc/xgboost
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
```
|
||||
- The git may show some conflicts it cannot merge, say ```conflicted.py```.
|
||||
- Manually modify the file to resolve the conflict.
|
||||
- After you resolved the conflict, mark it as resolved by
|
||||
```bash
|
||||
git add conflicted.py
|
||||
```
|
||||
- Then you can continue rebase by
|
||||
```bash
|
||||
git rebase --continue
|
||||
```
|
||||
- Finally push to your fork, you may need to force push here.
|
||||
```bash
|
||||
git push --force
|
||||
```
|
||||
|
||||
### How to combine multiple commits into one
|
||||
Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones,
|
||||
to create a PR with set of meaningful commits. You can do it by following steps.
|
||||
- Before doing so, configure the default editor of git if you haven't done so before.
|
||||
```bash
|
||||
git config core.editor the-editor-you-like
|
||||
```
|
||||
- Assume we want to merge last 3 commits, type the following commands
|
||||
```bash
|
||||
git rebase -i HEAD~3
|
||||
```
|
||||
- It will pop up an text editor. Set the first commit as ```pick```, and change later ones to ```squash```.
|
||||
- After you saved the file, it will pop up another text editor to ask you modify the combined commit message.
|
||||
- Push the changes to your fork, you need to force push.
|
||||
```bash
|
||||
git push --force
|
||||
```
|
||||
|
||||
### What is the consequence of force push
|
||||
The previous two tips requires force push, this is because we altered the path of the commits.
|
||||
It is fine to force push to your own fork, as long as the commits changed are only yours.
|
||||
|
||||
Documents
|
||||
---------
|
||||
* The document is created using sphinx and [recommonmark](http://recommonmark.readthedocs.org/en/latest/)
|
||||
* You can build document locally to see the effect.
|
||||
|
||||
Testcases
|
||||
---------
|
||||
* All the testcases are in [tests](../tests)
|
||||
* We use python nose for python test cases.
|
||||
|
||||
Examples
|
||||
--------
|
||||
* Usecases and examples will be in [demo](../demo)
|
||||
* We are super excited to hear about your story, if you have blogposts,
|
||||
tutorials code solutions using xgboost, please tell us and we will add
|
||||
a link in the example pages.
|
||||
|
||||
Core Library
|
||||
------------
|
||||
- Follow Google C style for C++.
|
||||
- We use doxygen to document all the interface code.
|
||||
- You can reproduce the linter checks by typing ```make lint```
|
||||
|
||||
Python Package
|
||||
--------------
|
||||
- Always add docstring to the new functions in numpydoc format.
|
||||
- You can reproduce the linter checks by typing ```make lint```
|
||||
|
||||
R Package
|
||||
---------
|
||||
### Code Style
|
||||
- We follow Google's C++ Style guide on C++ code.
|
||||
- This is mainly to be consistent with the rest of the project.
|
||||
- Another reason is we will be able to check style automatically with a linter.
|
||||
- You can check the style of the code by typing the following command at root folder.
|
||||
```bash
|
||||
make rcpplint
|
||||
```
|
||||
- When needed, you can disable the linter warning of certain line with ```// NOLINT(*)``` comments.
|
||||
|
||||
### Rmarkdown Vignettes
|
||||
Rmarkdown vignettes are placed in [R-package/vignettes](../R-package/vignettes)
|
||||
These Rmarkdown files are not compiled. We host the compiled version on [doc/R-package](R-package)
|
||||
|
||||
The following steps are followed to add a new Rmarkdown vignettes:
|
||||
- Add the original rmarkdown to ```R-package/vignettes```
|
||||
- Modify ```doc/R-package/Makefile``` to add the markdown files to be build
|
||||
- Clone the [dmlc/web-data](https://github.com/dmlc/web-data) repo to folder ```doc```
|
||||
- Now type the following command on ```doc/R-package```
|
||||
```bash
|
||||
make the-markdown-to-make.md
|
||||
```
|
||||
- This will generate the markdown, as well as the figures into ```doc/web-data/xgboost/knitr```
|
||||
- Modify the ```doc/R-package/index.md``` to point to the generated markdown.
|
||||
- Add the generated figure to the ```dmlc/web-data``` repo.
|
||||
- If you already cloned the repo to doc, this means a ```git add```
|
||||
- Create PR for both the markdown and ```dmlc/web-data```
|
||||
- You can also build the document locally by typing the followig command at ```doc```
|
||||
```bash
|
||||
make html
|
||||
```
|
||||
The reason we do this is to avoid exploded repo size due to generated images sizes.
|
||||
42
doc/how_to/external_memory.md
Normal file
42
doc/how_to/external_memory.md
Normal file
@@ -0,0 +1,42 @@
|
||||
Using XGBoost External Memory Version(beta)
|
||||
===========================================
|
||||
There is no big difference between using external memory version and in-memory version.
|
||||
The only difference is the filename format.
|
||||
|
||||
The external memory version takes in the following filename format
|
||||
```
|
||||
filename#cacheprefix
|
||||
```
|
||||
|
||||
The ```filename``` is the normal path to libsvm file you want to load in, ```cacheprefix``` is a
|
||||
path to a cache file that xgboost will use for external memory cache.
|
||||
|
||||
The following code was extracted from [../demo/guide-python/external_memory.py](../demo/guide-python/external_memory.py)
|
||||
```python
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
||||
```
|
||||
You can find that there is additional ```#dtrain.cache``` following the libsvm file, this is the name of cache file.
|
||||
For CLI version, simply use ```"../data/agaricus.txt.train#dtrain.cache"``` in filename.
|
||||
|
||||
Performance Note
|
||||
----------------
|
||||
* the parameter ```nthread``` should be set to number of ***real*** cores
|
||||
- Most modern CPU offer hyperthreading, which means you can have a 4 core cpu with 8 threads
|
||||
- Set nthread to be 4 for maximum performance in such case
|
||||
|
||||
Distributed Version
|
||||
-------------------
|
||||
The external memory mode naturally works on distributed version, you can simply set path like
|
||||
```
|
||||
data = "hdfs:///path-to-data/#dtrain.cache"
|
||||
```
|
||||
xgboost will cache the data to the local position. When you run on YARN, the current folder is temporal
|
||||
so that you can directly use ```dtrain.cache``` to cache to current folder.
|
||||
|
||||
|
||||
Usage Note
|
||||
----------
|
||||
* This is a experimental version
|
||||
- If you like to try and test it, report results to https://github.com/dmlc/xgboost/issues/244
|
||||
* Currently only importing from libsvm format is supported
|
||||
- Contribution of ingestion from other common external memory data source is welcomed
|
||||
16
doc/how_to/index.md
Normal file
16
doc/how_to/index.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# XGBoost How To
|
||||
|
||||
This page contains guidelines to use and develop mxnets.
|
||||
|
||||
## Installation
|
||||
- [How to Install XGBoost](../build.md)
|
||||
|
||||
## Use XGBoost in Specific Ways
|
||||
- [Parameter tunning guide](param_tuning.md)
|
||||
- [Use out of core computation for large dataset](external_memory.md)
|
||||
|
||||
## Develop and Hack XGBoost
|
||||
- [Contribute to XGBoost](contribute.md)
|
||||
|
||||
## Frequently Ask Questions
|
||||
- [FAQ](../faq.md)
|
||||
44
doc/how_to/param_tuning.md
Normal file
44
doc/how_to/param_tuning.md
Normal file
@@ -0,0 +1,44 @@
|
||||
Notes on Parameter Tuning
|
||||
=========================
|
||||
Parameter tuning is a dark art in machine learning, the optimal parameters
|
||||
of a model can depend on many scenarios. So it is impossible to create a
|
||||
comprehensive guide for doing so.
|
||||
|
||||
This document tries to provide some guideline for parameters in xgboost.
|
||||
|
||||
|
||||
Understanding Bias-Variance Tradeoff
|
||||
------------------------------------
|
||||
If you take a machine learning or statistics course, this is likely to be one
|
||||
of the most important concepts.
|
||||
When we allow the model to get more complicated (e.g. more depth), the model
|
||||
has better ability to fit the training data, resulting in a less biased model.
|
||||
However, such complicated model requires more data to fit.
|
||||
|
||||
Most of parameters in xgboost are about bias variance tradeoff. The best model
|
||||
should trade the model complexity with its predictive power carefully.
|
||||
[Parameters Documentation](parameter.md) will tell you whether each parameter
|
||||
will make the model more conservative or not. This can be used to help you
|
||||
turn the knob between complicated model and simple model.
|
||||
|
||||
Control Overfitting
|
||||
-------------------
|
||||
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.
|
||||
|
||||
There are in general two ways that you can control overfitting in xgboost
|
||||
* The first way is to directly control model complexity
|
||||
- This include ```max_depth```, ```min_child_weight``` and ```gamma```
|
||||
* The second way is to add randomness to make training robust to noise
|
||||
- This include ```subsample```, ```colsample_bytree```
|
||||
- You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
|
||||
|
||||
Handle Imbalanced Dataset
|
||||
-------------------------
|
||||
For common cases such as ads clickthrough log, the dataset is extremely imbalanced.
|
||||
This can affect the training of xgboost model, and there are two ways to improve it.
|
||||
* If you care only about the ranking order (AUC) of your prediction
|
||||
- Balance the positive and negative weights, via ```scale_pos_weight```
|
||||
- Use AUC for evaluation
|
||||
* If you care about predicting the right probability
|
||||
- In such a case, you cannot re-balance the dataset
|
||||
- In such a case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence
|
||||
Reference in New Issue
Block a user