diff --git a/demo/guide-python/README.md b/demo/guide-python/README.md index bc1c219d0..32d0290ab 100644 --- a/demo/guide-python/README.md +++ b/demo/guide-python/README.md @@ -7,3 +7,5 @@ XGBoost Python Feature Walkthrough * [Generalized Linear Model](generalized_linear_model.py) * [Cross validation](cross_validation.py) * [Predicting leaf indices](predict_leaf_indices.py) +* [Sklearn Wrapper](sklearn_example.py) +* [External Memory](external_memory.py) diff --git a/demo/guide-python/external_memory.py b/demo/guide-python/external_memory.py new file mode 100755 index 000000000..eb579c935 --- /dev/null +++ b/demo/guide-python/external_memory.py @@ -0,0 +1,25 @@ +#!/usr/bin/python +import numpy as np +import scipy.sparse +import xgboost as xgb + +### simple example for using external memory version + +# this is the only difference, add a # followed by a cache prefix name +# several cache file with the prefix will be generated +# currently only support convert from libsvm file +dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache') +dtest = xgb.DMatrix('../data/agaricus.txt.test#dtest.cache') + +# specify validations set to watch performance +param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' } + +# performance notice: set nthread to be the number of your real cpu +# some cpu offer two threads per core, for example, a 4 core cpu with 8 threads, in such case set nthread=4 +#param['nthread']=num_real_cpu + +watchlist = [(dtest,'eval'), (dtrain,'train')] +num_round = 2 +bst = xgb.train(param, dtrain, num_round, watchlist) + + diff --git a/doc/README.md b/doc/README.md new file mode 100644 index 000000000..801e7e65a --- /dev/null +++ b/doc/README.md @@ -0,0 +1,19 @@ +XGBoost Documentation +==== +This is an ongoing effort to move the [wiki document](https://github.com/dmlc/xgboost/wiki) to here. + +List of Documentations +==== +* [Parameters](parameter.md) +* [Using XGBoost in Python](python.md) +* [External Memory Version](external_memory.md) + +Highlights Links +==== +This section is about blogposts, presentation and videos discussing how to use xgboost to solve your interesting problem. If you think something belongs to here, send a pull request. +* Blogpost by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/) +* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution) + +Contribution +==== +Contribution of document usecases are welcomed! diff --git a/doc/external_memory.md b/doc/external_memory.md new file mode 100644 index 000000000..e98133467 --- /dev/null +++ b/doc/external_memory.md @@ -0,0 +1,32 @@ +Using XGBoost External Memory Version +==== +There is no big difference between using external memory version and in-memory version. +The only difference is the filename format. + +The external memory version takes in the following filename format +``` +filename#cacheprefix +``` + +The ```filename``` is the normal path to libsvm file you want to load in, ```cacheprefix``` is a +path to a cache file that xgboost will use for external memory cache. + +The following code was extracted from [../demo/guide-python/external_memory.py](../demo/guide-python/external_memory.py) +```python +dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache') +``` +You can find that there is additional ```#dtrain.cache``` following the libsvm file, this is the name of cache file. +For CLI version, simply use ```"../data/agaricus.txt.train#dtrain.cache"``` in filename. + +Performance Note +==== +* the parameter ```nthread``` should be set to number of ***real*** cores + - Most modern CPU offer hyperthreading, which means you can have a 4 core cpu with 8 threads + - Set nthread to be 4 for maximum performance in such case + +Usage Note: +==== +* This is a experimental version + - If you like to try and test it, report results to https://github.com/dmlc/xgboost/issues/244 +* Currently only importing from libsvm format is supported + - Contribution of ingestion from other common external memory data source is welcomed diff --git a/doc/parameter.md b/doc/parameter.md new file mode 100644 index 000000000..2ced29935 --- /dev/null +++ b/doc/parameter.md @@ -0,0 +1,111 @@ +XGBoost Parameters +==== +Before running XGboost, we must set three types of parameters, general parameters, booster parameters and task parameters: +- General parameters relates to which booster we are using to do boosting, commonly tree or linear model +- Booster parameters depends on which booster you have chosen +- Task parameters that decides on the learning scenario, for example, regression tasks may use different parameters with ranking tasks. +- In addition to these parameters, there can be console parameters that relates to behavior of console version of xgboost(e.g. when to save model) + +### Parameters in R Package +In R-package, you can use .(dot) to replace under score in the parameters, for example, you can use max.depth as max_depth. The underscore parameters are also valid in R. + +### General Parameters +* booster [default=gbtree] + - which booster to use, can be gbtree or gblinear. The details about different boosters are described [here](https://github.com/dmlc/xgboost/wiki/Boosters). +* silent [default=0] + - 0 means printing running messages, 1 means silent mode. +* nthread [default to maximum number of threads available if not set] + - number of parallel threads used to run xgboost +* num_pbuffer [set automatically by xgboost, no need to be set by user] + - size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step. +* num_feature [set automatically by xgboost, no need to be set by user] + - feature dimension used in boosting, set to maximum dimension of the feature + +### Booster Parameters +From xgboost-unity, the ```bst:``` prefix is no longer needed for booster parameters. Parameter with or without bst: prefix will be equivalent(i.e. both bst:eta and eta will be valid parameter setting) . + +#### Parameter for Tree Booster +* eta [default=0.3] + - step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. +* gamma + - minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. +* max_depth [default=6] + - maximum depth of a tree +* min_child_weight [default=1] + - minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. +* max_delta_step [default=0] + - Maximum delta step we allow each tree's weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update +* subsample [default=1] + - subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. +* colsample_bytree [default=1] + - subsample ratio of columns when constructing each tree. + +#### Parameter for Linear Booster +* lambda [default=0] + - L2 regularization term on weights +* alpha [default=0] + - L1 regularization term on weights +* lambda_bias + - L2 regularization term on bias, default 0(no L1 reg on bias because it is not important) + +### Task Parameters +* objective [ default=reg:linear ] + - specify the learning task and the corresponding learning objective, and the objective options are below: + - "reg:linear" --linear regression + - "reg:logistic" --logistic regression + - "binary:logistic" --logistic regression for binary classification, output probability + - "binary:logitraw" --logistic regression for binary classification, output score before logistic transformation + - "multi:softmax" --set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes) + - "multi:softprob" --same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class. + - "rank:pairwise" --set XGBoost to do ranking task by minimizing the pairwise loss +* base_score [ default=0.5 ] + - the initial prediction score of all instances, global bias +* eval_metric [ default according to objective ] + - evaluation metrics for validation data, a default metric will be assigned according to objective( rmse for regression, and error for classification, mean average precision for ranking ) + - User can add multiple evaluation metrics, for python user, remember to pass the metrics in as list of parameters pairs instead of map, so that latter 'eval_metric' won't override previous one + - The choices are listed below: + - "rmse": [root mean square error](http://en.wikipedia.org/wiki/Root_mean_square_error) + - "logloss": negative [log-likelihood](http://en.wikipedia.org/wiki/Log-likelihood) + - "error": Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. + - "merror": Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases). + - "mlogloss": Multiclass logloss + - "auc": [Area under the curve](http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve) for ranking evaluation. + - "ndcg":[Normalized Discounted Cumulative Gain](http://en.wikipedia.org/wiki/NDCG) + - "map":[Mean average precision](http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision) + - "ndcg@n","map@n": n can be assigned as an integer to cut off the top positions in the lists for evaluation. + - "ndcg-","map-","ndcg@n-","map@n-": In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding "-" in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. +training repeatively +* seed [ default=0 ] + - random number seed. + +### Console Parameters +The following parameters are only used in the console version of xgboost +* use_buffer [ default=1 ] + - whether create binary buffer for text input, this normally will speedup loading when do +* num_round + - the number of round for boosting. +* data + - The path of training data +* test:data + - The path of test data to do prediction +* save_period [default=0] + - the period to save the model, setting save_period=10 means that for every 10 rounds XGBoost will save the model, setting it to 0 means not save any model during training. +* task [default=train] options: train, pred, eval, dump + - train: training using data + - pred: making prediction for test:data + - eval: for evaluating statistics specified by eval[name]=filenam + - dump: for dump the learned model into text format(preliminary) +* model_in [default=NULL] + - path to input model, needed for test, eval, dump, if it is specified in training, xgboost will continue training from the input model +* model_out [default=NULL] + - path to output model after training finishes, if not specified, will output like 0003.model where 0003 is number of rounds to do boosting. +* model_dir [default=models] + - The output directory of the saved models during training +* fmap + - feature map, used for dump model +* name_dump [default=dump.txt] + - name of model dump file +* name_pred [default=pred.txt] + - name of prediction file, used in pred mode +* pred_margin [default=0] + - predict margin instead of transformed probability diff --git a/doc/python.md b/doc/python.md new file mode 100644 index 000000000..233a6f797 --- /dev/null +++ b/doc/python.md @@ -0,0 +1,126 @@ +XGBoost Python Module +==== + +This page will introduce XGBoost Python module, including: +* [Building and Import](#building-and-import) +* [Data Interface](#data-interface) +* [Setting Parameters](#setting-parameters) +* [Train Model](#training-model) +* [Early Stopping](#early-stopping) +* [Prediction](#prediction) + +A [walk through python example](https://github.com/tqchen/xgboost/blob/master/demo/guide-python) for UCI Mushroom dataset is provided. + += +#### Install + +To install XGBoost, you need to run `make` in the root directory of the project and then in the `wrappers` directory run + +```shell +python setup.py install +``` +Then import the module in Python as usual +```python +import xgboost as xgb +``` + += +#### Data Interface +XGBoost python module is able to loading from libsvm txt format file, Numpy 2D array and xgboost binary buffer file. The data will be store in ```DMatrix``` object. + +* To load libsvm text format file and XGBoost binary file into ```DMatrix```, the usage is like +```python +dtrain = xgb.DMatrix('train.svm.txt') +dtest = xgb.DMatrix('test.svm.buffer') +``` +* To load numpy array into ```DMatrix```, the usage is like +```python +data = np.random.rand(5,10) # 5 entities, each contains 10 features +label = np.random.randint(2, size=5) # binary target +dtrain = xgb.DMatrix( data, label=label) +``` +* Build ```DMatrix``` from ```scipy.sparse``` +```python +csr = scipy.sparse.csr_matrix( (dat, (row,col)) ) +dtrain = xgb.DMatrix( csr ) +``` +* Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time. The usage is like: +```python +dtrain = xgb.DMatrix('train.svm.txt') +dtrain.save_binary("train.buffer") +``` +* To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` like: +```python +dtrain = xgb.DMatrix( data, label=label, missing = -999.0) +``` +* Weight can be set when needed, like +```python +w = np.random.rand(5,1) +dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w) +``` + + += +#### Setting Parameters +XGBoost use list of pair to save [parameters](https://github.com/tqchen/xgboost/wiki/Parameters). Eg +* Booster parameters +```python +param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' } +param['nthread'] = 4 +plst = param.items() +plst += [('eval_metric', 'auc')] # Multiple evals can be handled in this way +plst += [('eval_metric', 'ams@0')] +``` +* Specify validations set to watch performance +```python +evallist = [(dtest,'eval'), (dtrain,'train')] +``` + += +#### Training Model +With parameter list and data, you are able to train a model. +* Training +```python +num_round = 10 +bst = xgb.train( plst, dtrain, num_round, evallist ) +``` +* Saving model +After training, you can save model and dump it out. +```python +bst.save_model('0001.model') +``` +* Dump Model and Feature Map +You can dump the model to txt and review the meaning of model +```python +# dump model +bst.dump_model('dump.raw.txt') +# dump model with feature map +bst.dump_model('dump.raw.txt','featmap.txt') +``` +* Loading model +After you save your model, you can load model file at anytime by using +```python +bst = xgb.Booster({'nthread':4}) #init model +bst.load_model("model.bin") # load data +``` += +#### Early stopping + +If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in `evals`. If there's more than one, it will use the last. + +`train(..., evals=evals, early_stopping_rounds=10)` + +The model will train until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. + +If early stopping occurs, the model will have two additional fields: `bst.best_score` and `bst.best_iteration`. Note that `train()` will return a model from the last iteration, not the best one. + +This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). + += +#### Prediction +After you training/loading a model and preparing the data, you can start to do prediction. +```python +data = np.random.rand(7,10) # 7 entities, each contains 10 features +dtest = xgb.DMatrix( data, missing = -999.0 ) +ypred = bst.predict( xgmat ) +```