Group CLI demo into subdirectory. (#6258)

CLI is not most developed interface. Putting them into correct directory can help new users to avoid it as most of the use cases are from a language binding.
2020-10-29 05:40:44 +08:00
parent 6383757dca
commit dfac5f89e9
32 changed files with 146 additions and 100 deletions
--- a/demo/CLI/binary_classification/README.md
+++ b/demo/CLI/binary_classification/README.md
@@ -0,0 +1,164 @@
+Binary Classification
+=====================
+This is the quick start tutorial for xgboost CLI version.
+Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```.
+The script 'runexp.sh' can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository.
+
+### Tutorial
+#### Generate Input Data
+XGBoost takes LibSVM format. An example of faked input data is below:
+```
+1 101:1.2 102:0.03
+0 1:2.1 10001:300 10002:400
+...
+```
+Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
+
+
+First we will transform the dataset into classic LibSVM format and split the data into training set and test set by running:
+```
+python mapfeat.py
+python mknfold.py agaricus.txt 1
+```
+The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set.
+
+#### Training
+Then we can run the training process:
+```
+../../xgboost mushroom.conf
+```
+
+mushroom.conf is the configuration for both training and testing. Each line containing the [attribute]=[value] configuration:
+
+```conf
+# General Parameters, see comment for each definition
+# can be gbtree or gblinear
+booster = gbtree
+# choose logistic regression loss function for binary classification
+objective = binary:logistic
+
+# Tree Booster Parameters
+# step size shrinkage
+eta = 1.0
+# minimum loss reduction required to make a further partition
+gamma = 1.0
+# minimum sum of instance weight(hessian) needed in a child
+min_child_weight = 1
+# maximum depth of a tree
+max_depth = 3
+
+# Task Parameters
+# the number of round to do boosting
+num_round = 2
+# 0 means do not save any model except the final round model
+save_period = 0
+# The path of training data
+data = "agaricus.txt.train"
+# The path of validation data, used to monitor training process, here [test] sets name of the validation set
+eval[test] = "agaricus.txt.test"
+# The path of test data
+test:data = "agaricus.txt.test"
+```
+We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification.
+
+The parameters shown in the example gives the most common ones that are needed to use xgboost.
+If you are interested in more parameter settings, the complete parameter settings and detailed descriptions are [here](../../doc/parameter.rst). Besides putting the parameters in the configuration file, we can set them by passing them as arguments as below:
+
+```
+../../xgboost mushroom.conf max_depth=6
+```
+This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and  the config file, the command line setting will override the setting in config file.
+
+In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below:
+```conf
+# General Parameters
+# choose the linear booster
+booster = gblinear
+...
+
+# Change Tree Booster Parameters into Linear Booster Parameters
+# L2 regularization term on weights, default 0
+lambda = 0.01
+# L1 regularization term on weights, default 0
+alpha = 0.01
+# L2 regularization term on bias, default 0
+lambda_bias = 0.01
+
+# Regression Parameters
+...
+```
+
+#### Get Predictions
+After training, we can use the output model to get the prediction of the test data:
+```
+../../xgboost mushroom.conf task=pred model_in=0002.model
+```
+For binary classification, the output predictions are probability confidence scores in [0,1], corresponds to the probability of the label to be positive.
+
+#### Dump Model
+This is a preliminary feature, so only tree models support text dump. XGBoost can display the tree models in text or JSON files, and we can scan the model in an easy way:
+```
+../../xgboost mushroom.conf task=dump model_in=0002.model name_dump=dump.raw.txt
+../../xgboost mushroom.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt
+```
+
+In this demo, the tree boosters obtained will be printed in dump.raw.txt and dump.nice.txt, and the latter one is easier to understand because of usage of feature mapping featmap.txt
+
+Format of ```featmap.txt: <featureid> <featurename> <q or i or int>\n ```:
+  - Feature id must be from 0 to number of features, in sorted order.
+  - i means this feature is binary indicator feature
+  - q means this feature is a quantitative value, such as age, time, can be missing
+  - int means this feature is integer value (when int is hinted, the decision boundary will be integer)
+
+#### Monitoring Progress
+When you run training we can find there are messages displayed on screen
+```
+tree train end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
+[0]  test-error:0.016139
+boosting round 1, 0 sec elapsed
+
+tree train end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
+[1]  test-error:0.000000
+```
+The messages for evaluation are printed into stderr, so if you want only to log the evaluation progress, simply type
+```
+../../xgboost mushroom.conf 2>log.txt
+```
+Then you can find the following content in log.txt
+```
+[0]     test-error:0.016139
+[1]     test-error:0.000000
+```
+We can also monitor both training and test statistics, by adding following lines to configure
+```conf
+eval[test] = "agaricus.txt.test"
+eval[trainname] = "agaricus.txt.train"
+```
+Run the command again, we can find the log file becomes
+```
+[0]     test-error:0.016139     trainname-error:0.014433
+[1]     test-error:0.000000     trainname-error:0.001228
+```
+The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round.
+
+xgboost also supports monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes
+```
+[0]     test-error:0.016139     test-negllik:0.029795   trainname-error:0.014433        trainname-negllik:0.027023
+[1]     test-error:0.000000     test-negllik:0.000000   trainname-error:0.001228        trainname-negllik:0.002457
+```
+### Saving Progress Models
+If you want to save model every two round, simply set save_period=2. You will find 0002.model in the current folder. If you want to change the output folder of models, add model_dir=foldername. By default xgboost saves the model of last round.
+
+#### Continue from Existing Model
+If you want to continue boosting from existing model, say 0002.model, use
+```
+../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model
+```
+xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
+#### Use Multi-Threading
+When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to your configuration.
+Eg. ```nthread=10```
+
+Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```)
+Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8.
+
--- a/demo/CLI/binary_classification/agaricus-lepiota.data
+++ b/demo/CLI/binary_classification/agaricus-lepiota.data
--- a/demo/CLI/binary_classification/agaricus-lepiota.fmap
+++ b/demo/CLI/binary_classification/agaricus-lepiota.fmap
@@ -0,0 +1,32 @@
+     1. cap-shape:                bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
+     2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
+     3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
+     4. bruises?:                 bruises=t,no=f
+     5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
+                                  musty=m,none=n,pungent=p,spicy=s
+     6. gill-attachment:          attached=a,descending=d,free=f,notched=n
+     7. gill-spacing:             close=c,crowded=w,distant=d
+     8. gill-size:                broad=b,narrow=n
+     9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
+                                  green=r,orange=o,pink=p,purple=u,red=e,
+                                  white=w,yellow=y
+    10. stalk-shape:              enlarging=e,tapering=t
+    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
+                                  rhizomorphs=z,rooted=r,missing=?
+    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
+    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
+    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
+                                  pink=p,red=e,white=w,yellow=y
+    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
+                                  pink=p,red=e,white=w,yellow=y
+    16. veil-type:                partial=p,universal=u
+    17. veil-color:               brown=n,orange=o,white=w,yellow=y
+    18. ring-number:              none=n,one=o,two=t
+    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
+                                  none=n,pendant=p,sheathing=s,zone=z
+    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
+                                  orange=o,purple=u,white=w,yellow=y
+    21. population:               abundant=a,clustered=c,numerous=n,
+                                  scattered=s,several=v,solitary=y
+    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
+                                  urban=u,waste=w,woods=d
--- a/demo/CLI/binary_classification/agaricus-lepiota.names
+++ b/demo/CLI/binary_classification/agaricus-lepiota.names
@@ -0,0 +1,148 @@
+1. Title: Mushroom Database
+
+2. Sources: 
+    (a) Mushroom records drawn from The Audubon Society Field Guide to North
+        American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
+        A. Knopf
+    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
+    (c) Date: 27 April 1987
+
+3. Past Usage:
+    1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
+       Adjustment (Technical Report 87-19).  Doctoral disseration, Department
+       of Information and Computer Science, University of California, Irvine.
+       --- STAGGER: asymptoted to 95% classification accuracy after reviewing
+           1000 instances.
+    2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity
+       and Coverage in Incremental Concept Learning. In Proceedings of 
+       the 5th International Conference on Machine Learning, 73-79.
+       Ann Arbor, Michigan: Morgan Kaufmann.  
+       -- approximately the same results with their HILLARY algorithm    
+    3. In the following references a set of rules (given below) were
+	learned for this data set which may serve as a point of
+	comparison for other researchers.
+
+	Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules
+	from training data using backpropagation networks, in: Proc. of the
+	The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30,
+	available on-line at: http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/
+
+	Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of
+	crisp logical rules using constrained backpropagation networks -
+	comparison of two new approaches, in: Proc. of the European Symposium
+	on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997,
+	pp. xx-xx
+
+	Wlodzislaw Duch, Department of Computer Methods, Nicholas Copernicus
+	University, 87-100 Torun, Grudziadzka 5, Poland
+	e-mail: duch@phys.uni.torun.pl
+	WWW     http://www.phys.uni.torun.pl/kmk/
+	
+	Date: Mon, 17 Feb 1997 13:47:40 +0100
+	From: Wlodzislaw Duch <duch@phys.uni.torun.pl>
+	Organization: Dept. of Computer Methods, UMK
+
+	I have attached a file containing logical rules for mushrooms.
+	It should be helpful for other people since only in the last year I
+	have seen about 10 papers analyzing this dataset and obtaining quite
+	complex rules. We will try to contribute other results later.
+
+	With best regards, Wlodek Duch
+	________________________________________________________________
+
+	Logical rules for the mushroom data sets.
+
+	Logical rules given below seem to be the simplest possible for the
+	mushroom dataset and therefore should be treated as benchmark results.
+
+	Disjunctive rules for poisonous mushrooms, from most general
+	to most specific:
+
+	P_1) odor=NOT(almond.OR.anise.OR.none)
+	     120 poisonous cases missed, 98.52% accuracy
+
+	P_2) spore-print-color=green
+	     48 cases missed, 99.41% accuracy
+         
+	P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.
+	          (stalk-color-above-ring=NOT.brown) 
+	     8 cases missed, 99.90% accuracy
+         
+	P_4) habitat=leaves.AND.cap-color=white
+	         100% accuracy     
+
+	Rule P_4) may also be
+
+	P_4') population=clustered.AND.cap_color=white
+
+	These rule involve 6 attributes (out of 22). Rules for edible
+	mushrooms are obtained as negation of the rules given above, for
+	example the rule:
+
+	odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green
+
+	gives 48 errors, or 99.41% accuracy on the whole dataset.
+
+	Several slightly more complex variations on these rules exist,
+	involving other attributes, such as gill_size, gill_spacing,
+	stalk_surface_above_ring, but the rules given above are the simplest
+	we have found.
+
+
+4. Relevant Information:
+    This data set includes descriptions of hypothetical samples
+    corresponding to 23 species of gilled mushrooms in the Agaricus and
+    Lepiota Family (pp. 500-525).  Each species is identified as
+    definitely edible, definitely poisonous, or of unknown edibility and
+    not recommended.  This latter class was combined with the poisonous
+    one.  The Guide clearly states that there is no simple rule for
+    determining the edibility of a mushroom; no rule like ``leaflets
+    three, let it be'' for Poisonous Oak and Ivy.
+
+5. Number of Instances: 8124
+
+6. Number of Attributes: 22 (all nominally valued)
+
+7. Attribute Information: (classes: edible=e, poisonous=p)
+     1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
+                                  knobbed=k,sunken=s
+     2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
+     3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
+                                  pink=p,purple=u,red=e,white=w,yellow=y
+     4. bruises?:                 bruises=t,no=f
+     5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
+                                  musty=m,none=n,pungent=p,spicy=s
+     6. gill-attachment:          attached=a,descending=d,free=f,notched=n
+     7. gill-spacing:             close=c,crowded=w,distant=d
+     8. gill-size:                broad=b,narrow=n
+     9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
+                                  green=r,orange=o,pink=p,purple=u,red=e,
+                                  white=w,yellow=y
+    10. stalk-shape:              enlarging=e,tapering=t
+    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
+                                  rhizomorphs=z,rooted=r,missing=?
+    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
+    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
+    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
+                                  pink=p,red=e,white=w,yellow=y
+    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
+                                  pink=p,red=e,white=w,yellow=y
+    16. veil-type:                partial=p,universal=u
+    17. veil-color:               brown=n,orange=o,white=w,yellow=y
+    18. ring-number:              none=n,one=o,two=t
+    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
+                                  none=n,pendant=p,sheathing=s,zone=z
+    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
+                                  orange=o,purple=u,white=w,yellow=y
+    21. population:               abundant=a,clustered=c,numerous=n,
+                                  scattered=s,several=v,solitary=y
+    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
+                                  urban=u,waste=w,woods=d
+
+8. Missing Attribute Values: 2480 of them (denoted by "?"), all for
+   attribute #11.
+
+9. Class Distribution: 
+    --    edible: 4208 (51.8%)
+    -- poisonous: 3916 (48.2%)
+    --     total: 8124 instances
--- a/demo/CLI/binary_classification/mapfeat.py
+++ b/demo/CLI/binary_classification/mapfeat.py
@@ -0,0 +1,47 @@
+#!/usr/bin/python
+
+def loadfmap( fname ):
+    fmap = {}
+    nmap = {}
+
+    for l in open( fname ):
+        arr = l.split()
+        if arr[0].find('.') != -1:
+            idx = int( arr[0].strip('.') )
+            assert idx not in fmap
+            fmap[ idx ] = {}
+            ftype = arr[1].strip(':')
+            content = arr[2]
+        else:
+            content = arr[0]
+        for it in content.split(','):
+            if it.strip() == '':
+                continue
+            k , v = it.split('=')
+            fmap[ idx ][ v ] = len(nmap)
+            nmap[ len(nmap) ] = ftype+'='+k
+    return fmap, nmap
+
+def write_nmap( fo, nmap ):
+    for i in range( len(nmap) ):
+        fo.write('%d\t%s\ti\n' % (i, nmap[i]) )
+
+# start here
+fmap, nmap = loadfmap( 'agaricus-lepiota.fmap' )
+fo = open( 'featmap.txt', 'w' )
+write_nmap( fo, nmap )
+fo.close()
+
+fo = open( 'agaricus.txt', 'w' )
+for l in open( 'agaricus-lepiota.data' ):
+    arr = l.split(',')
+    if arr[0] == 'p':
+        fo.write('1')
+    else:
+        assert arr[0] == 'e'
+        fo.write('0')
+    for i in range( 1,len(arr) ):
+        fo.write( ' %d:1' % fmap[i][arr[i].strip()] )
+    fo.write('\n')
+
+fo.close()
--- a/demo/CLI/binary_classification/mknfold.py
+++ b/demo/CLI/binary_classification/mknfold.py
@@ -0,0 +1,29 @@
+#!/usr/bin/python
+import sys
+import random
+
+if len(sys.argv) < 2:
+    print ('Usage:<filename> <k> [nfold = 5]')
+    exit(0)
+
+random.seed( 10 )
+
+k = int( sys.argv[2] )
+if len(sys.argv) > 3:
+    nfold = int( sys.argv[3] )
+else:
+    nfold = 5
+
+fi = open( sys.argv[1], 'r' )
+ftr = open( sys.argv[1]+'.train', 'w' )
+fte = open( sys.argv[1]+'.test', 'w' )
+for l in fi:
+    if random.randint( 1 , nfold ) == k:
+        fte.write( l )
+    else:
+        ftr.write( l )
+
+fi.close()
+ftr.close()
+fte.close()
+
--- a/demo/CLI/binary_classification/mushroom.conf
+++ b/demo/CLI/binary_classification/mushroom.conf
@@ -0,0 +1,29 @@
+# General Parameters, see comment for each definition
+# choose the booster, can be gbtree or gblinear
+booster = gbtree
+# choose logistic regression loss function for binary classification
+objective = binary:logistic
+
+# Tree Booster Parameters
+# step size shrinkage
+eta = 1.0
+# minimum loss reduction required to make a further partition
+gamma = 1.0
+# minimum sum of instance weight(hessian) needed in a child
+min_child_weight = 1
+# maximum depth of a tree
+max_depth = 3
+
+# Task Parameters
+# the number of round to do boosting
+num_round = 2
+# 0 means do not save any model except the final round model
+save_period = 2
+# The path of training data
+data = "agaricus.txt.train"
+# The path of validation data, used to monitor training process, here [test] sets name of the validation set
+eval[test] = "agaricus.txt.test"
+# evaluate on training data as well each round
+eval_train = 1
+# The path of test data
+test:data = "agaricus.txt.test"
--- a/demo/CLI/binary_classification/runexp.sh
+++ b/demo/CLI/binary_classification/runexp.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+# map feature using indicator encoding, also produce featmap.txt
+python mapfeat.py
+# split train and test
+python mknfold.py agaricus.txt 1
+
+XGBOOST=../../../xgboost
+
+# training and output the models
+$XGBOOST mushroom.conf
+# output prediction task=pred
+$XGBOOST mushroom.conf task=pred model_in=0002.model
+# print the boosters of 00002.model in dump.raw.txt
+$XGBOOST mushroom.conf task=dump model_in=0002.model name_dump=dump.raw.txt
+# use the feature map in printing for better visualization
+$XGBOOST mushroom.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt
+cat dump.nice.txt