Doc modernization (#3474)
* Change doc build to reST exclusively * Rewrite Intro doc in reST; create toctree * Update parameter and contribute * Convert tutorials to reST * Convert Python tutorials to reST * Convert CLI and Julia docs to reST * Enable markdown for R vignettes * Done migrating to reST * Add guzzle_sphinx_theme to requirements * Add breathe to requirements * Fix search bar * Add link to user forum
This commit is contained in:
committed by
Philip Cho
parent
d0f45bede0
commit
e19dded9a3
@@ -1,164 +0,0 @@
|
||||
Contribute to XGBoost
|
||||
=====================
|
||||
XGBoost has been developed and used by a group of active community members.
|
||||
Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
|
||||
|
||||
- Please add your name to [CONTRIBUTORS.md](../../CONTRIBUTORS.md) after your patch has been merged.
|
||||
- Please also update [NEWS.md](../../NEWS.md) to add note on your changes to the API or added a new document.
|
||||
|
||||
Guidelines
|
||||
----------
|
||||
* [Submit Pull Request](#submit-pull-request)
|
||||
* [Git Workflow Howtos](#git-workflow-howtos)
|
||||
- [How to resolve conflict with master](#how-to-resolve-conflict-with-master)
|
||||
- [How to combine multiple commits into one](#how-to-combine-multiple-commits-into-one)
|
||||
- [What is the consequence of force push](#what-is-the-consequence-of-force-push)
|
||||
* [Document](#document)
|
||||
* [Testcases](#testcases)
|
||||
* [Examples](#examples)
|
||||
* [Core Library](#core-library)
|
||||
* [Python Package](#python-package)
|
||||
* [R Package](#r-package)
|
||||
|
||||
Submit Pull Request
|
||||
-------------------
|
||||
* Before submit, please rebase your code on the most recent version of master, you can do it by
|
||||
```bash
|
||||
git remote add upstream https://github.com/dmlc/xgboost
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
```
|
||||
* If you have multiple small commits,
|
||||
it might be good to merge them together(use git rebase then squash) into more meaningful groups.
|
||||
* Send the pull request!
|
||||
- Fix the problems reported by automatic checks
|
||||
- If you are contributing a new module, consider add a testcase in [tests](../tests)
|
||||
|
||||
Git Workflow Howtos
|
||||
-------------------
|
||||
### How to resolve conflict with master
|
||||
- First rebase to most recent master
|
||||
```bash
|
||||
# The first two steps can be skipped after you do it once.
|
||||
git remote add upstream https://github.com/dmlc/xgboost
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
```
|
||||
- The git may show some conflicts it cannot merge, say ```conflicted.py```.
|
||||
- Manually modify the file to resolve the conflict.
|
||||
- After you resolved the conflict, mark it as resolved by
|
||||
```bash
|
||||
git add conflicted.py
|
||||
```
|
||||
- Then you can continue rebase by
|
||||
```bash
|
||||
git rebase --continue
|
||||
```
|
||||
- Finally push to your fork, you may need to force push here.
|
||||
```bash
|
||||
git push --force
|
||||
```
|
||||
|
||||
### How to combine multiple commits into one
|
||||
Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones,
|
||||
to create a PR with set of meaningful commits. You can do it by following steps.
|
||||
- Before doing so, configure the default editor of git if you haven't done so before.
|
||||
```bash
|
||||
git config core.editor the-editor-you-like
|
||||
```
|
||||
- Assume we want to merge last 3 commits, type the following commands
|
||||
```bash
|
||||
git rebase -i HEAD~3
|
||||
```
|
||||
- It will pop up an text editor. Set the first commit as ```pick```, and change later ones to ```squash```.
|
||||
- After you saved the file, it will pop up another text editor to ask you modify the combined commit message.
|
||||
- Push the changes to your fork, you need to force push.
|
||||
```bash
|
||||
git push --force
|
||||
```
|
||||
|
||||
### What is the consequence of force push
|
||||
The previous two tips requires force push, this is because we altered the path of the commits.
|
||||
It is fine to force push to your own fork, as long as the commits changed are only yours.
|
||||
|
||||
Documents
|
||||
---------
|
||||
* The document is created using sphinx and [recommonmark](http://recommonmark.readthedocs.org/en/latest/)
|
||||
* You can build document locally to see the effect.
|
||||
|
||||
Testcases
|
||||
---------
|
||||
* All the testcases are in [tests](../tests)
|
||||
* We use python nose for python test cases.
|
||||
|
||||
Examples
|
||||
--------
|
||||
* Usecases and examples will be in [demo](../demo)
|
||||
* We are super excited to hear about your story, if you have blogposts,
|
||||
tutorials code solutions using xgboost, please tell us and we will add
|
||||
a link in the example pages.
|
||||
|
||||
Core Library
|
||||
------------
|
||||
- Follow Google C style for C++.
|
||||
- We use doxygen to document all the interface code.
|
||||
- You can reproduce the linter checks by typing ```make lint```
|
||||
|
||||
Python Package
|
||||
--------------
|
||||
- Always add docstring to the new functions in numpydoc format.
|
||||
- You can reproduce the linter checks by typing ```make lint```
|
||||
|
||||
R Package
|
||||
---------
|
||||
### Code Style
|
||||
- We follow Google's C++ Style guide on C++ code.
|
||||
- This is mainly to be consistent with the rest of the project.
|
||||
- Another reason is we will be able to check style automatically with a linter.
|
||||
- You can check the style of the code by typing the following command at root folder.
|
||||
```bash
|
||||
make rcpplint
|
||||
```
|
||||
- When needed, you can disable the linter warning of certain line with ```// NOLINT(*)``` comments.
|
||||
- We use [roxygen](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html) for documenting the R package.
|
||||
|
||||
### Rmarkdown Vignettes
|
||||
Rmarkdown vignettes are placed in [R-package/vignettes](../R-package/vignettes)
|
||||
These Rmarkdown files are not compiled. We host the compiled version on [doc/R-package](R-package)
|
||||
|
||||
The following steps are followed to add a new Rmarkdown vignettes:
|
||||
- Add the original rmarkdown to ```R-package/vignettes```
|
||||
- Modify ```doc/R-package/Makefile``` to add the markdown files to be build
|
||||
- Clone the [dmlc/web-data](https://github.com/dmlc/web-data) repo to folder ```doc```
|
||||
- Now type the following command on ```doc/R-package```
|
||||
```bash
|
||||
make the-markdown-to-make.md
|
||||
```
|
||||
- This will generate the markdown, as well as the figures into ```doc/web-data/xgboost/knitr```
|
||||
- Modify the ```doc/R-package/index.md``` to point to the generated markdown.
|
||||
- Add the generated figure to the ```dmlc/web-data``` repo.
|
||||
- If you already cloned the repo to doc, this means a ```git add```
|
||||
- Create PR for both the markdown and ```dmlc/web-data```
|
||||
- You can also build the document locally by typing the following command at ```doc```
|
||||
```bash
|
||||
make html
|
||||
```
|
||||
The reason we do this is to avoid exploded repo size due to generated images sizes.
|
||||
|
||||
### R package versioning
|
||||
Since version 0.6.4.3, we have adopted a versioning system that uses an ```x.y.z``` (or ```core_major.core_minor.cran_release```)
|
||||
format for CRAN releases and an ```x.y.z.p``` (or ```core_major.core_minor.cran_release.patch```) format for development patch versions.
|
||||
This approach is similar to the one described in Yihui Xie's
|
||||
[blog post on R Package Versioning](https://yihui.name/en/2013/06/r-package-versioning/),
|
||||
except we need an additional field to accomodate the ```x.y``` core library version.
|
||||
|
||||
Each new CRAN release bumps up the 3rd field, while developments in-between CRAN releases
|
||||
would be marked by an additional 4th field on the top of an existing CRAN release version.
|
||||
Some additional consideration is needed when the core library version changes.
|
||||
E.g., after the core changes from 0.6 to 0.7, the R package development version would become 0.7.0.1, working towards
|
||||
a 0.7.1 CRAN release. The 0.7.0 would not be released to CRAN, unless it would require almost no additional development.
|
||||
|
||||
### Registering native routines in R
|
||||
According to [R extension manual](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Registering-native-routines),
|
||||
it is good practice to register native routines and to disable symbol search. When any changes or additions are made to the
|
||||
C++ interface of the R package, please make corresponding changes in ```src/init.c``` as well.
|
||||
@@ -1,42 +0,0 @@
|
||||
Using XGBoost External Memory Version(beta)
|
||||
===========================================
|
||||
There is no big difference between using external memory version and in-memory version.
|
||||
The only difference is the filename format.
|
||||
|
||||
The external memory version takes in the following filename format
|
||||
```
|
||||
filename#cacheprefix
|
||||
```
|
||||
|
||||
The ```filename``` is the normal path to libsvm file you want to load in, ```cacheprefix``` is a
|
||||
path to a cache file that xgboost will use for external memory cache.
|
||||
|
||||
The following code was extracted from [../../demo/guide-python/external_memory.py](../../demo/guide-python/external_memory.py)
|
||||
```python
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
||||
```
|
||||
You can find that there is additional ```#dtrain.cache``` following the libsvm file, this is the name of cache file.
|
||||
For CLI version, simply use ```"../data/agaricus.txt.train#dtrain.cache"``` in filename.
|
||||
|
||||
Performance Note
|
||||
----------------
|
||||
* the parameter ```nthread``` should be set to number of ***real*** cores
|
||||
- Most modern CPU offer hyperthreading, which means you can have a 4 core cpu with 8 threads
|
||||
- Set nthread to be 4 for maximum performance in such case
|
||||
|
||||
Distributed Version
|
||||
-------------------
|
||||
The external memory mode naturally works on distributed version, you can simply set path like
|
||||
```
|
||||
data = "hdfs://path-to-data/#dtrain.cache"
|
||||
```
|
||||
xgboost will cache the data to the local position. When you run on YARN, the current folder is temporal
|
||||
so that you can directly use ```dtrain.cache``` to cache to current folder.
|
||||
|
||||
|
||||
Usage Note
|
||||
----------
|
||||
* This is a experimental version
|
||||
- If you like to try and test it, report results to https://github.com/dmlc/xgboost/issues/244
|
||||
* Currently only importing from libsvm format is supported
|
||||
- Contribution of ingestion from other common external memory data source is welcomed
|
||||
@@ -1,44 +0,0 @@
|
||||
Notes on Parameter Tuning
|
||||
=========================
|
||||
Parameter tuning is a dark art in machine learning, the optimal parameters
|
||||
of a model can depend on many scenarios. So it is impossible to create a
|
||||
comprehensive guide for doing so.
|
||||
|
||||
This document tries to provide some guideline for parameters in xgboost.
|
||||
|
||||
|
||||
Understanding Bias-Variance Tradeoff
|
||||
------------------------------------
|
||||
If you take a machine learning or statistics course, this is likely to be one
|
||||
of the most important concepts.
|
||||
When we allow the model to get more complicated (e.g. more depth), the model
|
||||
has better ability to fit the training data, resulting in a less biased model.
|
||||
However, such complicated model requires more data to fit.
|
||||
|
||||
Most of parameters in xgboost are about bias variance tradeoff. The best model
|
||||
should trade the model complexity with its predictive power carefully.
|
||||
[Parameters Documentation](../parameter.md) will tell you whether each parameter
|
||||
will make the model more conservative or not. This can be used to help you
|
||||
turn the knob between complicated model and simple model.
|
||||
|
||||
Control Overfitting
|
||||
-------------------
|
||||
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.
|
||||
|
||||
There are in general two ways that you can control overfitting in xgboost
|
||||
* The first way is to directly control model complexity
|
||||
- This include ```max_depth```, ```min_child_weight``` and ```gamma```
|
||||
* The second way is to add randomness to make training robust to noise
|
||||
- This include ```subsample```, ```colsample_bytree```
|
||||
- You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
|
||||
|
||||
Handle Imbalanced Dataset
|
||||
-------------------------
|
||||
For common cases such as ads clickthrough log, the dataset is extremely imbalanced.
|
||||
This can affect the training of xgboost model, and there are two ways to improve it.
|
||||
* If you care only about the ranking order (AUC) of your prediction
|
||||
- Balance the positive and negative weights, via ```scale_pos_weight```
|
||||
- Use AUC for evaluation
|
||||
* If you care about predicting the right probability
|
||||
- In such a case, you cannot re-balance the dataset
|
||||
- In such a case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence
|
||||
Reference in New Issue
Block a user