[DOC] refactor doc

This commit is contained in:
tqchen
2016-05-20 13:09:42 -07:00
parent 149589c583
commit 84ae514d7e
14 changed files with 128 additions and 57 deletions

145
doc/how_to/contribute.md Normal file
View File

@@ -0,0 +1,145 @@
Contribute to XGBoost
=====================
XGBoost has been developed and used by a group of active community members.
Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
- Please add your name to [CONTRIBUTORS.md](../CONTRIBUTORS.md) after your patch has been merged.
- Please also update [NEWS.md](../NEWS.md) to add note on your changes to the API or added a new document.
Guidelines
----------
* [Submit Pull Request](#submit-pull-request)
* [Git Workflow Howtos](#git-workflow-howtos)
- [How to resolve conflict with master](#how-to-resolve-conflict-with-master)
- [How to combine multiple commits into one](#how-to-combine-multiple-commits-into-one)
- [What is the consequence of force push](#what-is-the-consequence-of-force-push)
* [Document](#document)
* [Testcases](#testcases)
* [Examples](#examples)
* [Core Library](#core-library)
* [Python Package](#python-package)
* [R Package](#r-package)
Submit Pull Request
-------------------
* Before submit, please rebase your code on the most recent version of master, you can do it by
```bash
git remote add upstream https://github.com/dmlc/xgboost
git fetch upstream
git rebase upstream/master
```
* If you have multiple small commits,
it might be good to merge them together(use git rebase then squash) into more meaningful groups.
* Send the pull request!
- Fix the problems reported by automatic checks
- If you are contributing a new module, consider add a testcase in [tests](../tests)
Git Workflow Howtos
-------------------
### How to resolve conflict with master
- First rebase to most recent master
```bash
# The first two steps can be skipped after you do it once.
git remote add upstream https://github.com/dmlc/xgboost
git fetch upstream
git rebase upstream/master
```
- The git may show some conflicts it cannot merge, say ```conflicted.py```.
- Manually modify the file to resolve the conflict.
- After you resolved the conflict, mark it as resolved by
```bash
git add conflicted.py
```
- Then you can continue rebase by
```bash
git rebase --continue
```
- Finally push to your fork, you may need to force push here.
```bash
git push --force
```
### How to combine multiple commits into one
Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones,
to create a PR with set of meaningful commits. You can do it by following steps.
- Before doing so, configure the default editor of git if you haven't done so before.
```bash
git config core.editor the-editor-you-like
```
- Assume we want to merge last 3 commits, type the following commands
```bash
git rebase -i HEAD~3
```
- It will pop up an text editor. Set the first commit as ```pick```, and change later ones to ```squash```.
- After you saved the file, it will pop up another text editor to ask you modify the combined commit message.
- Push the changes to your fork, you need to force push.
```bash
git push --force
```
### What is the consequence of force push
The previous two tips requires force push, this is because we altered the path of the commits.
It is fine to force push to your own fork, as long as the commits changed are only yours.
Documents
---------
* The document is created using sphinx and [recommonmark](http://recommonmark.readthedocs.org/en/latest/)
* You can build document locally to see the effect.
Testcases
---------
* All the testcases are in [tests](../tests)
* We use python nose for python test cases.
Examples
--------
* Usecases and examples will be in [demo](../demo)
* We are super excited to hear about your story, if you have blogposts,
tutorials code solutions using xgboost, please tell us and we will add
a link in the example pages.
Core Library
------------
- Follow Google C style for C++.
- We use doxygen to document all the interface code.
- You can reproduce the linter checks by typing ```make lint```
Python Package
--------------
- Always add docstring to the new functions in numpydoc format.
- You can reproduce the linter checks by typing ```make lint```
R Package
---------
### Code Style
- We follow Google's C++ Style guide on C++ code.
- This is mainly to be consistent with the rest of the project.
- Another reason is we will be able to check style automatically with a linter.
- You can check the style of the code by typing the following command at root folder.
```bash
make rcpplint
```
- When needed, you can disable the linter warning of certain line with ```// NOLINT(*)``` comments.
### Rmarkdown Vignettes
Rmarkdown vignettes are placed in [R-package/vignettes](../R-package/vignettes)
These Rmarkdown files are not compiled. We host the compiled version on [doc/R-package](R-package)
The following steps are followed to add a new Rmarkdown vignettes:
- Add the original rmarkdown to ```R-package/vignettes```
- Modify ```doc/R-package/Makefile``` to add the markdown files to be build
- Clone the [dmlc/web-data](https://github.com/dmlc/web-data) repo to folder ```doc```
- Now type the following command on ```doc/R-package```
```bash
make the-markdown-to-make.md
```
- This will generate the markdown, as well as the figures into ```doc/web-data/xgboost/knitr```
- Modify the ```doc/R-package/index.md``` to point to the generated markdown.
- Add the generated figure to the ```dmlc/web-data``` repo.
- If you already cloned the repo to doc, this means a ```git add```
- Create PR for both the markdown and ```dmlc/web-data```
- You can also build the document locally by typing the followig command at ```doc```
```bash
make html
```
The reason we do this is to avoid exploded repo size due to generated images sizes.

View File

@@ -0,0 +1,42 @@
Using XGBoost External Memory Version(beta)
===========================================
There is no big difference between using external memory version and in-memory version.
The only difference is the filename format.
The external memory version takes in the following filename format
```
filename#cacheprefix
```
The ```filename``` is the normal path to libsvm file you want to load in, ```cacheprefix``` is a
path to a cache file that xgboost will use for external memory cache.
The following code was extracted from [../demo/guide-python/external_memory.py](../demo/guide-python/external_memory.py)
```python
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
```
You can find that there is additional ```#dtrain.cache``` following the libsvm file, this is the name of cache file.
For CLI version, simply use ```"../data/agaricus.txt.train#dtrain.cache"``` in filename.
Performance Note
----------------
* the parameter ```nthread``` should be set to number of ***real*** cores
- Most modern CPU offer hyperthreading, which means you can have a 4 core cpu with 8 threads
- Set nthread to be 4 for maximum performance in such case
Distributed Version
-------------------
The external memory mode naturally works on distributed version, you can simply set path like
```
data = "hdfs:///path-to-data/#dtrain.cache"
```
xgboost will cache the data to the local position. When you run on YARN, the current folder is temporal
so that you can directly use ```dtrain.cache``` to cache to current folder.
Usage Note
----------
* This is a experimental version
- If you like to try and test it, report results to https://github.com/dmlc/xgboost/issues/244
* Currently only importing from libsvm format is supported
- Contribution of ingestion from other common external memory data source is welcomed

16
doc/how_to/index.md Normal file
View File

@@ -0,0 +1,16 @@
# XGBoost How To
This page contains guidelines to use and develop mxnets.
## Installation
- [How to Install XGBoost](../build.md)
## Use XGBoost in Specific Ways
- [Parameter tunning guide](param_tuning.md)
- [Use out of core computation for large dataset](external_memory.md)
## Develop and Hack XGBoost
- [Contribute to XGBoost](contribute.md)
## Frequently Ask Questions
- [FAQ](../faq.md)

View File

@@ -0,0 +1,44 @@
Notes on Parameter Tuning
=========================
Parameter tuning is a dark art in machine learning, the optimal parameters
of a model can depend on many scenarios. So it is impossible to create a
comprehensive guide for doing so.
This document tries to provide some guideline for parameters in xgboost.
Understanding Bias-Variance Tradeoff
------------------------------------
If you take a machine learning or statistics course, this is likely to be one
of the most important concepts.
When we allow the model to get more complicated (e.g. more depth), the model
has better ability to fit the training data, resulting in a less biased model.
However, such complicated model requires more data to fit.
Most of parameters in xgboost are about bias variance tradeoff. The best model
should trade the model complexity with its predictive power carefully.
[Parameters Documentation](parameter.md) will tell you whether each parameter
will make the model more conservative or not. This can be used to help you
turn the knob between complicated model and simple model.
Control Overfitting
-------------------
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.
There are in general two ways that you can control overfitting in xgboost
* The first way is to directly control model complexity
- This include ```max_depth```, ```min_child_weight``` and ```gamma```
* The second way is to add randomness to make training robust to noise
- This include ```subsample```, ```colsample_bytree```
- You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
Handle Imbalanced Dataset
-------------------------
For common cases such as ads clickthrough log, the dataset is extremely imbalanced.
This can affect the training of xgboost model, and there are two ways to improve it.
* If you care only about the ranking order (AUC) of your prediction
- Balance the positive and negative weights, via ```scale_pos_weight```
- Use AUC for evaluation
* If you care about predicting the right probability
- In such a case, you cannot re-balance the dataset
- In such a case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence