Merge pull request #876 from tqchen/master

[DOC] reorg docs
This commit is contained in:
Tianqi Chen 2016-02-25 14:08:48 -08:00
commit 1176f9ac1b
5 changed files with 73 additions and 62 deletions

View File

@ -10,7 +10,7 @@
[Documentation](https://xgboost.readthedocs.org) | [Documentation](https://xgboost.readthedocs.org) |
[Resources](demo/README.md) | [Resources](demo/README.md) |
[Installation](https://xgboost.readthedocs.org/en/latest/build.html) | [Installation](https://xgboost.readthedocs.org/en/latest/build.html) |
[Release Notes](NEWS.md)| [Release Notes](NEWS.md) |
[RoadMap](https://github.com/dmlc/xgboost/issues/873) [RoadMap](https://github.com/dmlc/xgboost/issues/873)
XGBoost is an optimized distributed gradient boosting library designed to be highly ***efficient***, ***flexible*** and ***portable***. XGBoost is an optimized distributed gradient boosting library designed to be highly ***efficient***, ***flexible*** and ***portable***.
@ -22,17 +22,19 @@ What's New
---------- ----------
* [XGBoost brick](NEWS.md) Release * [XGBoost brick](NEWS.md) Release
Ask a Question Ask a Question
-------------- --------------
* For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page. * For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page.
* For generic questions for to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) * For generic questions for to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/)
Contributing to XGBoost Help to Make XGBoost Better
----------------------- ---------------------------
XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users. XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone.
* Check out [call for contributions](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+is%3Aclosed+label%3Acall-for-contribution) and [Roadmap](https://github.com/dmlc/xgboost/issues/873) to see what can be improved, or open an issue if you want something. - Check out [call for contributions](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+is%3Aclosed+label%3Acall-for-contribution) and [Roadmap](https://github.com/dmlc/xgboost/issues/873) to see what can be improved, or open an issue if you want something.
* Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users. - Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users.
* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) and after your patch has been merged. - Add your stories and experience to [Awesome XGBoost](demo/README.md).
- Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) and after your patch has been merged.
- Please also update [NEWS.md](NEWS.md) on changes and improvements in API and docs. - Please also update [NEWS.md](NEWS.md) on changes and improvements in API and docs.
License License

View File

@ -14,8 +14,8 @@ Contents
- [Benchmarks](#benchmarks) - [Benchmarks](#benchmarks)
- [Machine Learning Challenge Winning Solutions](#machine-learning-challenge-winning-solutions) - [Machine Learning Challenge Winning Solutions](#machine-learning-challenge-winning-solutions)
- [Tutorials](#tutorials) - [Tutorials](#tutorials)
- [Usecases](#usecases)
- [Tools using XGBoost](#tools-using-xgboost) - [Tools using XGBoost](#tools-using-xgboost)
- [Services Powered by XGBoost](#services-powered-by-xgboost)
- [Awards](#awards) - [Awards](#awards)
Code Examples Code Examples
@ -101,15 +101,20 @@ Please send pull requests if you find ones that are missing here.
- [Notes on eXtreme Gradient Boosting](http://startup.ml/blog/xgboost) by ARSHAK NAVRUZYAN ([iPython Notebook](https://github.com/startupml/koan/blob/master/eXtreme%20Gradient%20Boosting.ipynb)) - [Notes on eXtreme Gradient Boosting](http://startup.ml/blog/xgboost) by ARSHAK NAVRUZYAN ([iPython Notebook](https://github.com/startupml/koan/blob/master/eXtreme%20Gradient%20Boosting.ipynb))
## Usecases
If you have particular usecase of xgboost that you would like to highlight.
Send a PR to add a one sentence description:)
- XGBoost is used in [Kaggle Script](https://www.kaggle.com/scripts) to solve data science challenges.
- [Seldon predictive service powered by XGBoost](http://docs.seldon.io/iris-demo.html)
- XGBoost Distributed is used in [ODPS Cloud Service by Alibaba](https://yq.aliyun.com/articles/6355) (in Chinese)
- XGBoost is incoporated as part of [Graphlab Create](https://dato.com/products/create/) for scalable machine learning.
## Tools using XGBoost ## Tools using XGBoost
- [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API - [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API
## Services Powered by XGBoost
- [Seldon predictive service powered by XGBoost](http://docs.seldon.io/iris-demo.html)
- [ODPS by Alibaba](https://yq.aliyun.com/articles/6355) (in Chinese)
## Awards ## Awards
- [John Chambers Award](http://stat-computing.org/awards/jmc/winners.html) - 2016 Winner: XGBoost R Package, by Tong He (Simon Fraser University) and Tianqi Chen (University of Washington) - [John Chambers Award](http://stat-computing.org/awards/jmc/winners.html) - 2016 Winner: XGBoost R Package, by Tong He (Simon Fraser University) and Tianqi Chen (University of Washington)

View File

@ -1,15 +1,15 @@
Binary Classification Binary Classification
==== =====================
This is the quick start tutorial for xgboost CLI version. You can also checkout [../../doc/README.md](../../doc/README.md) for links to tutorial in python or R. This is the quick start tutorial for xgboost CLI version.
Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make``` Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```
The script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository. The script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository.
### Tutorial ### Tutorial
#### Generate Input Data #### Generate Input Data
XGBoost takes LibSVM format. An example of faked input data is below: XGBoost takes LibSVM format. An example of faked input data is below:
``` ```
1 101:1.2 102:0.03 1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400 0 1:2.1 10001:300 10002:400
... ...
``` ```
Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive. Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
@ -22,7 +22,7 @@ python mknfold.py agaricus.txt 1
``` ```
The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set. The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set.
#### Training #### Training
Then we can run the training process: Then we can run the training process:
``` ```
../../xgboost mushroom.conf ../../xgboost mushroom.conf
@ -33,31 +33,31 @@ mushroom.conf is the configuration for both training and testing. Each line cont
```conf ```conf
# General Parameters, see comment for each definition # General Parameters, see comment for each definition
# can be gbtree or gblinear # can be gbtree or gblinear
booster = gbtree booster = gbtree
# choose logistic regression loss function for binary classification # choose logistic regression loss function for binary classification
objective = binary:logistic objective = binary:logistic
# Tree Booster Parameters # Tree Booster Parameters
# step size shrinkage # step size shrinkage
eta = 1.0 eta = 1.0
# minimum loss reduction required to make a further partition # minimum loss reduction required to make a further partition
gamma = 1.0 gamma = 1.0
# minimum sum of instance weight(hessian) needed in a child # minimum sum of instance weight(hessian) needed in a child
min_child_weight = 1 min_child_weight = 1
# maximum depth of a tree # maximum depth of a tree
max_depth = 3 max_depth = 3
# Task Parameters # Task Parameters
# the number of round to do boosting # the number of round to do boosting
num_round = 2 num_round = 2
# 0 means do not save any model except the final round model # 0 means do not save any model except the final round model
save_period = 0 save_period = 0
# The path of training data # The path of training data
data = "agaricus.txt.train" data = "agaricus.txt.train"
# The path of validation data, used to monitor training process, here [test] sets name of the validation set # The path of validation data, used to monitor training process, here [test] sets name of the validation set
eval[test] = "agaricus.txt.test" eval[test] = "agaricus.txt.test"
# The path of test data # The path of test data
test:data = "agaricus.txt.test" test:data = "agaricus.txt.test"
``` ```
We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification. We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification.
@ -70,7 +70,7 @@ If you are interested in more parameter settings, the complete parameter setting
This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and the config file, the command line setting will override the setting in config file. This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and the config file, the command line setting will override the setting in config file.
In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below: In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below:
```conf ```conf
# General Parameters # General Parameters
# choose the linear booster # choose the linear booster
booster = gblinear booster = gblinear
@ -86,15 +86,15 @@ f ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buf
- xgboost allows feature index starts from 0 - xgboost allows feature index starts from 0
- for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1 - for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1
- the feature indices in each line *do not* need to be sorted - the feature indices in each line *do not* need to be sorted
alpha = 0.01 alpha = 0.01
# L2 regularization term on bias, default 0 # L2 regularization term on bias, default 0
lambda_bias = 0.01 lambda_bias = 0.01
# Regression Parameters # Regression Parameters
... ...
``` ```
#### Get Predictions #### Get Predictions
After training, we can use the output model to get the prediction of the test data: After training, we can use the output model to get the prediction of the test data:
``` ```
../../xgboost mushroom.conf task=pred model_in=0003.model ../../xgboost mushroom.conf task=pred model_in=0003.model
@ -104,7 +104,7 @@ For binary classification, the output predictions are probability confidence sco
#### Dump Model #### Dump Model
This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way: This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way:
``` ```
../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt ../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt
../../xgboost mushroom.conf task=dump model_in=0003.model fmap=featmap.txt name_dump=dump.nice.txt ../../xgboost mushroom.conf task=dump model_in=0003.model fmap=featmap.txt name_dump=dump.nice.txt
``` ```
@ -137,8 +137,8 @@ Then you can find the following content in log.txt
``` ```
We can also monitor both training and test statistics, by adding following lines to configure We can also monitor both training and test statistics, by adding following lines to configure
```conf ```conf
eval[test] = "agaricus.txt.test" eval[test] = "agaricus.txt.test"
eval[trainname] = "agaricus.txt.train" eval[trainname] = "agaricus.txt.train"
``` ```
Run the command again, we can find the log file becomes Run the command again, we can find the log file becomes
``` ```
@ -162,15 +162,9 @@ If you want to continue boosting from existing model, say 0002.model, use
``` ```
xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function. xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
#### Use Multi-Threading #### Use Multi-Threading
When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration. When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration.
Eg. ```nthread=10``` Eg. ```nthread=10```
Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```) Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```)
Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8. Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8.
#### Additional Notes
* What are ```agaricus.txt.test.buffer``` and ```agaricus.txt.train.buffer``` generated during runexp.sh?
- By default xgboost will automatically generate a binary format buffer of input data, with suffix ```buffer```. Next time when you run xgboost, it will detects these binary files.

View File

@ -7,13 +7,22 @@ for large scale tree boosting.
This document is hosted at http://xgboost.readthedocs.org/. You can also browse most of the documents in github directly. This document is hosted at http://xgboost.readthedocs.org/. You can also browse most of the documents in github directly.
User Guide
---------- Package Documents
* [Installation Guide](build.md) -----------------
* [Introduction to Boosted Trees](model.md) This section contains language specific package guide.
* [XGBoost Command Line Usage Walkthrough](../demo/binary_classification/README.md)
* [Python Package Document](python/index.md) * [Python Package Document](python/index.md)
* [R Package Document](R-package/index.md) * [R Package Document](R-package/index.md)
* [XGBoost.jl Julia Package](https://github.com/dmlc/XGBoost.jl) * [XGBoost.jl Julia Package](https://github.com/dmlc/XGBoost.jl)
User Guides
-----------
This section contains users guides that are general across languages.
* [Installation Guide](build.md)
* [Introduction to Boosted Trees](model.md)
* [Distributed Training](../demo/distributed-training) * [Distributed Training](../demo/distributed-training)
* [Frequently Asked Questions](faq.md) * [Frequently Asked Questions](faq.md)
* [External Memory Version](external_memory.md) * [External Memory Version](external_memory.md)
@ -22,28 +31,24 @@ User Guide
* [Text input format](input_format.md) * [Text input format](input_format.md)
* [Notes on Parameter Tunning](param_tuning.md) * [Notes on Parameter Tunning](param_tuning.md)
Developer Guide
---------------
* [Contributor Guide](dev-guide/contribute.md)
Tutorials Tutorials
--------- ---------
Tutorials are self contained materials that teaches you how to achieve a complete data science task with xgboost, these This section contains official tutorials of XGBoost package.
are great resources to learn xgboost by real examples. If you think you have something that belongs to here, send a pull request. See [Awesome XGBoost](https://github.com/dmlc/xgboost/tree/master/demo) for links to mores resources.
* [Binary classification using XGBoost Command Line](../demo/binary_classification/) (CLI)
- This tutorial introduces the basic usage of CLI version of xgboost
* [Introduction of XGBoost in Python](python/python_intro.md) (python)
- This tutorial introduces the python package of xgboost
* [Introduction to XGBoost in R](R-package/xgboostPresentation.md) (R package) * [Introduction to XGBoost in R](R-package/xgboostPresentation.md) (R package)
- This is a general presentation about xgboost in R. - This is a general presentation about xgboost in R.
* [Discover your data with XGBoost in R](R-package/discoverYourData.md) (R package) * [Discover your data with XGBoost in R](R-package/discoverYourData.md) (R package)
- This tutorial explaining feature analysis in xgboost. - This tutorial explaining feature analysis in xgboost.
* [Introduction of XGBoost in Python](python/python_intro.md) (python)
- This tutorial introduces the python package of xgboost
* [Understanding XGBoost Model on Otto Dataset](../demo/kaggle-otto/understandingXGBoostModel.Rmd) (R package) * [Understanding XGBoost Model on Otto Dataset](../demo/kaggle-otto/understandingXGBoostModel.Rmd) (R package)
- This tutorial teaches you how to use xgboost to compete kaggle otto challenge. - This tutorial teaches you how to use xgboost to compete kaggle otto challenge.
Resources Developer Guide
--------- ---------------
See [awesome xgboost page](https://github.com/dmlc/xgboost/tree/master/demo) for links to other resources. * [Contributor Guide](dev-guide/contribute.md)
Indices and tables Indices and tables

View File

@ -12,9 +12,14 @@ train.txt
1 0:0.01 1:0.3 1 0:0.01 1:0.3
0 0:0.2 1:0.3 0 0:0.2 1:0.3
``` ```
Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive. Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instanc
e being positive.
## Group Input Format Additional Information
----------------------
Note: these additional information are only applicable to single machine version of the package.
### Group Input Format
As XGBoost supports accomplishing [ranking task](../demo/rank), we support the group input format. In ranking task, instances are categorized into different groups in real world scenarios, for example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. Except the instance file mentioned in the group input format, XGBoost need an file indicating the group information. For example, if the instance file is the "train.txt" shown above, As XGBoost supports accomplishing [ranking task](../demo/rank), we support the group input format. In ranking task, instances are categorized into different groups in real world scenarios, for example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. Except the instance file mentioned in the group input format, XGBoost need an file indicating the group information. For example, if the instance file is the "train.txt" shown above,
and the group file is as below: and the group file is as below:
@ -26,7 +31,7 @@ train.txt.group
This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order. This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order.
While configuration, you do not have to indicate the path of the group file. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.group" in the same directory and decides whether to read the data as group input format. While configuration, you do not have to indicate the path of the group file. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.group" in the same directory and decides whether to read the data as group input format.
## Instance Weight File ### Instance Weight File
XGBoost supports providing each instance an weight to differentiate the importance of instances. For example, if we provide an instance weight file for the "train.txt" file in the example as below: XGBoost supports providing each instance an weight to differentiate the importance of instances. For example, if we provide an instance weight file for the "train.txt" file in the example as below:
train.txt.weight train.txt.weight
@ -40,7 +45,7 @@ train.txt.weight
It means that XGBoost will emphasize more on the first and fourth instance that is to say positive instances while training. It means that XGBoost will emphasize more on the first and fourth instance that is to say positive instances while training.
The configuration is similar to configuring the group information. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.weight" in the same directory and if there is, will use the weights while training models. Weights will be included into an "xxx.buffer" file that is created by XGBoost automatically. If you want to update the weights, you need to delete the "xxx.buffer" file prior to launching XGBoost. The configuration is similar to configuring the group information. If the instance file name is "xxx", XGBoost will check whether there is a file named "xxx.weight" in the same directory and if there is, will use the weights while training models. Weights will be included into an "xxx.buffer" file that is created by XGBoost automatically. If you want to update the weights, you need to delete the "xxx.buffer" file prior to launching XGBoost.
## Initial Margin file ### Initial Margin file
XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for "train.txt" file, we can create the following file: XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for "train.txt" file, we can create the following file:
train.txt.base_margin train.txt.base_margin