[DOC] reorg docs
This commit is contained in:
@@ -1,15 +1,15 @@
|
||||
Binary Classification
|
||||
====
|
||||
This is the quick start tutorial for xgboost CLI version. You can also checkout [../../doc/README.md](../../doc/README.md) for links to tutorial in python or R.
|
||||
=====================
|
||||
This is the quick start tutorial for xgboost CLI version.
|
||||
Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```
|
||||
The script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository.
|
||||
The script runexp.sh can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository.
|
||||
|
||||
### Tutorial
|
||||
#### Generate Input Data
|
||||
XGBoost takes LibSVM format. An example of faked input data is below:
|
||||
```
|
||||
1 101:1.2 102:0.03
|
||||
0 1:2.1 10001:300 10002:400
|
||||
0 1:2.1 10001:300 10002:400
|
||||
...
|
||||
```
|
||||
Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
|
||||
@@ -22,7 +22,7 @@ python mknfold.py agaricus.txt 1
|
||||
```
|
||||
The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set.
|
||||
|
||||
#### Training
|
||||
#### Training
|
||||
Then we can run the training process:
|
||||
```
|
||||
../../xgboost mushroom.conf
|
||||
@@ -33,31 +33,31 @@ mushroom.conf is the configuration for both training and testing. Each line cont
|
||||
```conf
|
||||
# General Parameters, see comment for each definition
|
||||
# can be gbtree or gblinear
|
||||
booster = gbtree
|
||||
booster = gbtree
|
||||
# choose logistic regression loss function for binary classification
|
||||
objective = binary:logistic
|
||||
|
||||
# Tree Booster Parameters
|
||||
# step size shrinkage
|
||||
eta = 1.0
|
||||
eta = 1.0
|
||||
# minimum loss reduction required to make a further partition
|
||||
gamma = 1.0
|
||||
gamma = 1.0
|
||||
# minimum sum of instance weight(hessian) needed in a child
|
||||
min_child_weight = 1
|
||||
min_child_weight = 1
|
||||
# maximum depth of a tree
|
||||
max_depth = 3
|
||||
max_depth = 3
|
||||
|
||||
# Task Parameters
|
||||
# the number of round to do boosting
|
||||
num_round = 2
|
||||
# 0 means do not save any model except the final round model
|
||||
save_period = 0
|
||||
save_period = 0
|
||||
# The path of training data
|
||||
data = "agaricus.txt.train"
|
||||
data = "agaricus.txt.train"
|
||||
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
||||
eval[test] = "agaricus.txt.test"
|
||||
# The path of test data
|
||||
test:data = "agaricus.txt.test"
|
||||
eval[test] = "agaricus.txt.test"
|
||||
# The path of test data
|
||||
test:data = "agaricus.txt.test"
|
||||
```
|
||||
We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification.
|
||||
|
||||
@@ -70,7 +70,7 @@ If you are interested in more parameter settings, the complete parameter setting
|
||||
This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and the config file, the command line setting will override the setting in config file.
|
||||
|
||||
In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below:
|
||||
```conf
|
||||
```conf
|
||||
# General Parameters
|
||||
# choose the linear booster
|
||||
booster = gblinear
|
||||
@@ -86,15 +86,15 @@ f ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buf
|
||||
- xgboost allows feature index starts from 0
|
||||
- for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1
|
||||
- the feature indices in each line *do not* need to be sorted
|
||||
alpha = 0.01
|
||||
alpha = 0.01
|
||||
# L2 regularization term on bias, default 0
|
||||
lambda_bias = 0.01
|
||||
lambda_bias = 0.01
|
||||
|
||||
# Regression Parameters
|
||||
...
|
||||
```
|
||||
|
||||
#### Get Predictions
|
||||
#### Get Predictions
|
||||
After training, we can use the output model to get the prediction of the test data:
|
||||
```
|
||||
../../xgboost mushroom.conf task=pred model_in=0003.model
|
||||
@@ -104,7 +104,7 @@ For binary classification, the output predictions are probability confidence sco
|
||||
#### Dump Model
|
||||
This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way:
|
||||
```
|
||||
../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt
|
||||
../../xgboost mushroom.conf task=dump model_in=0003.model name_dump=dump.raw.txt
|
||||
../../xgboost mushroom.conf task=dump model_in=0003.model fmap=featmap.txt name_dump=dump.nice.txt
|
||||
```
|
||||
|
||||
@@ -137,8 +137,8 @@ Then you can find the following content in log.txt
|
||||
```
|
||||
We can also monitor both training and test statistics, by adding following lines to configure
|
||||
```conf
|
||||
eval[test] = "agaricus.txt.test"
|
||||
eval[trainname] = "agaricus.txt.train"
|
||||
eval[test] = "agaricus.txt.test"
|
||||
eval[trainname] = "agaricus.txt.train"
|
||||
```
|
||||
Run the command again, we can find the log file becomes
|
||||
```
|
||||
@@ -162,15 +162,9 @@ If you want to continue boosting from existing model, say 0002.model, use
|
||||
```
|
||||
xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
|
||||
#### Use Multi-Threading
|
||||
When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration.
|
||||
When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration.
|
||||
Eg. ```nthread=10```
|
||||
|
||||
Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```)
|
||||
Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8.
|
||||
|
||||
#### Additional Notes
|
||||
* What are ```agaricus.txt.test.buffer``` and ```agaricus.txt.train.buffer``` generated during runexp.sh?
|
||||
- By default xgboost will automatically generate a binary format buffer of input data, with suffix ```buffer```. Next time when you run xgboost, it will detects these binary files.
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user