[DOC] refactor doc

2016-05-20 13:09:42 -07:00
parent 149589c583
commit 84ae514d7e
14 changed files with 128 additions and 57 deletions
--- a/doc/how_to/contribute.md
+++ b/doc/how_to/contribute.md
@@ -0,0 +1,145 @@
+Contribute to XGBoost
+=====================
+XGBoost has been developed and used by a group of active community members.
+Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
+
+- Please add your name to [CONTRIBUTORS.md](../CONTRIBUTORS.md) after your patch has been merged.
+- Please also update [NEWS.md](../NEWS.md) to add note on your changes to the API or added a new document.
+
+Guidelines
+----------
+* [Submit Pull Request](#submit-pull-request)
+* [Git Workflow Howtos](#git-workflow-howtos)
+  - [How to resolve conflict with master](#how-to-resolve-conflict-with-master)
+  - [How to combine multiple commits into one](#how-to-combine-multiple-commits-into-one)
+  - [What is the consequence of force push](#what-is-the-consequence-of-force-push)
+* [Document](#document)
+* [Testcases](#testcases)
+* [Examples](#examples)
+* [Core Library](#core-library)
+* [Python Package](#python-package)
+* [R Package](#r-package)
+
+Submit Pull Request
+-------------------
+* Before submit, please rebase your code on the most recent version of master, you can do it by
+```bash
+git remote add upstream https://github.com/dmlc/xgboost
+git fetch upstream
+git rebase upstream/master
+```
+* If you have multiple small commits,
+  it might be good to merge them together(use git rebase then squash) into more meaningful groups.
+* Send the pull request!
+  - Fix the problems reported by automatic checks
+  - If you are contributing a new module, consider add a testcase in [tests](../tests)
+
+Git Workflow Howtos
+-------------------
+### How to resolve conflict with master
+- First rebase to most recent master
+```bash
+# The first two steps can be skipped after you do it once.
+git remote add upstream https://github.com/dmlc/xgboost
+git fetch upstream
+git rebase upstream/master
+```
+- The git may show some conflicts it cannot merge, say ```conflicted.py```.
+  - Manually modify the file to resolve the conflict.
+  - After you resolved the conflict, mark it as resolved by
+```bash
+git add conflicted.py
+```
+- Then you can continue rebase by
+```bash
+git rebase --continue
+```
+- Finally push to your fork, you may need to force push here.
+```bash
+git push --force
+```
+
+### How to combine multiple commits into one
+Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones,
+to create a PR with set of meaningful commits. You can do it by following steps.
+- Before doing so, configure the default editor of git if you haven't done so before.
+```bash
+git config core.editor the-editor-you-like
+```
+- Assume we want to merge last 3 commits, type the following commands
+```bash
+git rebase -i HEAD~3
+```
+- It will pop up an text editor. Set the first commit as ```pick```, and change later ones to ```squash```.
+- After you saved the file, it will pop up another text editor to ask you modify the combined commit message.
+- Push the changes to your fork, you need to force push.
+```bash
+git push --force
+```
+
+### What is the consequence of force push
+The previous two tips requires force push, this is because we altered the path of the commits.
+It is fine to force push to your own fork, as long as the commits changed are only yours.
+
+Documents
+---------
+* The document is created using sphinx and [recommonmark](http://recommonmark.readthedocs.org/en/latest/)
+* You can build document locally to see the effect.
+
+Testcases
+---------
+* All the testcases are in [tests](../tests)
+* We use python nose for python test cases.
+
+Examples
+--------
+* Usecases and examples will be in [demo](../demo)
+* We are super excited to hear about your story, if you have blogposts,
+  tutorials code solutions using xgboost, please tell us and we will add
+  a link in the example pages.
+
+Core Library
+------------
+- Follow Google C style for C++.
+- We use doxygen to document all the interface code.
+- You can reproduce the linter checks by typing ```make lint```
+
+Python Package
+--------------
+- Always add docstring to the new functions in numpydoc format.
+- You can reproduce the linter checks by typing ```make lint```
+
+R Package
+---------
+### Code Style
+- We follow Google's C++ Style guide on C++ code.
+  - This is mainly to be consistent with the rest of the project.
+  - Another reason is we will be able to check style automatically with a linter.
+- You can check the style of the code by typing the following command at root folder.
+```bash
+make rcpplint
+```
+- When needed, you can disable the linter warning of certain line with ```// NOLINT(*)``` comments.
+
+### Rmarkdown Vignettes
+Rmarkdown vignettes are placed in [R-package/vignettes](../R-package/vignettes)
+These Rmarkdown files are not compiled. We host the compiled version on [doc/R-package](R-package)
+
+The following steps are followed to add a new Rmarkdown vignettes:
+- Add the original rmarkdown to ```R-package/vignettes```
+- Modify ```doc/R-package/Makefile``` to add the markdown files to be build
+- Clone the [dmlc/web-data](https://github.com/dmlc/web-data) repo to folder ```doc```
+- Now type the following command on ```doc/R-package```
+```bash
+make the-markdown-to-make.md
+```
+- This will generate the markdown, as well as the figures into ```doc/web-data/xgboost/knitr```
+- Modify the ```doc/R-package/index.md``` to point to the generated markdown.
+- Add the generated figure to the ```dmlc/web-data``` repo.
+  - If you already cloned the repo to doc, this means a ```git add```
+- Create PR for both the markdown  and ```dmlc/web-data```
+- You can also build the document locally by typing the followig command at ```doc```
+```bash
+make html
+```
+The reason we do this is to avoid exploded repo size due to generated images sizes.
--- a/doc/how_to/external_memory.md
+++ b/doc/how_to/external_memory.md
@@ -0,0 +1,42 @@
+Using XGBoost External Memory Version(beta)
+===========================================
+There is no big difference between using external memory version and in-memory version.
+The only difference is the filename format.
+
+The external memory version takes in the following filename format
+```
+filename#cacheprefix
+```
+
+The ```filename``` is the normal path to libsvm file you want to load in, ```cacheprefix``` is a
+path to a cache file that xgboost will use for external memory cache.
+
+The following code was extracted from [../demo/guide-python/external_memory.py](../demo/guide-python/external_memory.py)
+```python
+dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
+```
+You can find that there is additional ```#dtrain.cache``` following the libsvm file, this is the name of cache file.
+For CLI version, simply use ```"../data/agaricus.txt.train#dtrain.cache"``` in filename.
+
+Performance Note
+----------------
+* the parameter ```nthread``` should be set to number of ***real*** cores
+  - Most modern CPU offer hyperthreading, which means you can have a 4 core cpu with 8 threads
+  - Set nthread to be 4 for maximum performance in such case
+
+Distributed Version
+-------------------
+The external memory mode naturally works on distributed version, you can simply set path like
+```
+data = "hdfs:///path-to-data/#dtrain.cache"
+```
+xgboost will cache the data to the local position. When you run on YARN, the current folder is temporal
+so that you can directly use ```dtrain.cache``` to cache to current folder.
+
+
+Usage Note
+----------
+* This is a experimental version
+  - If you like to try and test it, report results to https://github.com/dmlc/xgboost/issues/244
+* Currently only importing from libsvm format is supported
+  - Contribution of ingestion from other common external memory data source is welcomed
--- a/doc/how_to/index.md
+++ b/doc/how_to/index.md
@@ -0,0 +1,16 @@
+# XGBoost How To
+
+This page contains guidelines to use and develop mxnets.
+
+## Installation
+- [How to Install XGBoost](../build.md)
+
+## Use XGBoost in Specific Ways
+- [Parameter tunning guide](param_tuning.md)
+- [Use out of core computation for large dataset](external_memory.md)
+
+## Develop and Hack XGBoost
+- [Contribute to XGBoost](contribute.md)
+
+## Frequently Ask Questions
+- [FAQ](../faq.md)
--- a/doc/how_to/param_tuning.md
+++ b/doc/how_to/param_tuning.md
@@ -0,0 +1,44 @@
+Notes on Parameter Tuning
+=========================
+Parameter tuning is a dark art in machine learning, the optimal parameters
+of a model can depend on many scenarios. So it is impossible to create a
+comprehensive guide for doing so.
+
+This document tries to provide some guideline for parameters in xgboost.
+
+
+Understanding Bias-Variance Tradeoff
+------------------------------------
+If you take a machine learning or statistics course, this is likely to be one
+of the most important concepts.
+When we allow the model to get more complicated (e.g. more depth), the model
+has better ability to fit the training data, resulting in a less biased model.
+However, such complicated model requires more data to fit.
+
+Most of parameters in xgboost are about bias variance tradeoff. The best model
+should trade the model complexity with its predictive power carefully.
+[Parameters Documentation](parameter.md) will tell you whether each parameter
+will make the model more conservative or not. This can be used to help you
+turn the knob between complicated model and simple model.
+
+Control Overfitting
+-------------------
+When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.
+
+There are in general two ways that you can control overfitting in xgboost
+* The first way is to directly control model complexity
+  - This include ```max_depth```, ```min_child_weight``` and ```gamma```
+* The second way is to add randomness to make training robust to noise
+  - This include ```subsample```, ```colsample_bytree```
+  - You can also reduce stepsize ```eta```, but needs to remember to increase ```num_round``` when you do so.
+
+Handle Imbalanced Dataset
+-------------------------
+For common cases such as ads clickthrough log, the dataset is extremely imbalanced.
+This can affect the training of xgboost model, and there are two ways to improve it.
+* If you care only about the ranking order (AUC) of your prediction
+  - Balance the positive and negative weights, via ```scale_pos_weight```
+  - Use AUC for evaluation
+* If you care about predicting the right probability
+  - In such a case, you cannot re-balance the dataset
+  - In such a case, set parameter ```max_delta_step``` to a finite number (say 1) will help convergence