Merge branch 'mastet push origin unityr' into unity

This commit is contained in:
tqchen 2014-09-03 13:52:11 -07:00
commit 46cddb80f4
40 changed files with 384 additions and 237 deletions

View File

@ -1,18 +1,18 @@
Package: xgboost
Type: Package
Title: eXtreme Gradient Boosting
Version: 0.3-0
Version: 0.3-1
Date: 2014-08-23
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>
Maintainer: Tong He <hetong007@gmail.com>
Description: This package is a R wrapper of xgboost, which is short for eXtreme
Gradient Boosting. It is an efficient and scalable implementation of
gradient boosting framework. The package includes efficient linear model
solver and tree learning algorithm. The package can automatically do
solver and tree learning algorithms. The package can automatically do
parallel computation with OpenMP, and it can be more than 10 times faster
than existing gradient boosting packages such as gbm. It supports various
objective functions, including regression, classification and ranking. The
package is made to be extensible, so that user are also allowed to define
package is made to be extensible, so that users are also allowed to define
their own objectives easily.
License: Apache License (== 2.0) | file LICENSE
URL: https://github.com/tqchen/xgboost

View File

@ -52,8 +52,7 @@ This is an introductory document of using the \verb@xgboost@ package in R.
and scalable implementation of gradient boosting framework by \citep{friedman2001greedy}.
The package includes efficient linear model solver and tree learning algorithm.
It supports various objective functions, including regression, classification
and ranking. The package is made to be extendible, so that user are also allowed
to define there own objectives easily. It has several features:
and ranking. The package is made to be extendible, so that users are also allowed to define their own objectives easily. It has several features:
\begin{enumerate}
\item{Speed: }{\verb@xgboost@ can automatically do parallel computation on
Windows and Linux, with openmp. It is generally over 10 times faster than
@ -137,13 +136,10 @@ diris = xgb.DMatrix('iris.xgb.DMatrix')
\section{Advanced Examples}
The function \verb@xgboost@ is a simple function with less parameters, in order
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It
is more flexible than \verb@xgboost@, but it requires users to read the document
a bit more carefully.
The function \verb@xgboost@ is a simple function with less parameter, in order
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It is more flexible than \verb@xgboost@, but it requires users to read the document a bit more carefully.
\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it
supports advanced features as custom objective and evaluation functions.
\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it supports advanced features as custom objective and evaluation functions.
<<Customized loss function>>=
logregobj <- function(preds, dtrain) {
@ -213,3 +209,4 @@ competition.
\bibliography{xgboost}
\end{document}

View File

@ -8,6 +8,8 @@ Turorial and Documentation: https://github.com/tqchen/xgboost/wiki
Questions and Issues: [https://github.com/tqchen/xgboost/issues](https://github.com/tqchen/xgboost/issues?q=is%3Aissue+label%3Aquestion)
Examples Code: [demo folder](demo)
Notes on the Code: [Code Guide](src)
Features

25
demo/README.md Normal file
View File

@ -0,0 +1,25 @@
XGBoost Examples
====
This folder contains the all example codes using xgboost.
Contribution of exampls, benchmarks is more than welcomed!
If you like to share how you use xgboost to solve your problem, send a pull request:)
Features Walkthrough
====
This is a list of short codes introducing different functionalities of xgboost and its wrapper.
* Basic walkthrough of wrappers. [python](guide-python/basic_walkthrough.py)
* Cutomize loss function, and evaluation metric. [python](guide-python/custom_objective.py)
* Boosting from existing prediction. [python](guide-python/boost_from_prediction.py)
* Predicting using first n trees. [python](guide-python/predict_first_ntree.py)
* Cross validation(to come)
Basic Examples by Tasks
====
* [Binary classification](binary_classification)
* [Multiclass classification](multiclass_classification)
* [Regression](regression)
* [Learning to Rank](rank)
Benchmarks
====
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)

2
demo/data/README.md Normal file
View File

@ -0,0 +1,2 @@
This folder contains processed example dataset used by the demos.
Copyright of the dataset belongs to the original copyright holder

3
demo/guide-R/README.md Normal file
View File

@ -0,0 +1,3 @@
XGBoost R Feature Walkthrough
====
To be finished

5
demo/guide-R/runall.sh Executable file
View File

@ -0,0 +1,5 @@
#!/bin/bash
# todo
Rscript basic_walkthrough.R
Rscript custom_objective.R
Rscript boost_from_prediction.R

View File

@ -0,0 +1,6 @@
XGBoost Python Feature Walkthrough
====
* [Basic walkthrough of wrappers](basic_walkthrough.py)
* [Cutomize loss function, and evaluation metric](custom_objective.py)
* [Boosting from existing prediction](boost_from_prediction.py)
* [Predicting using first n trees](predict_first_ntree.py)

View File

@ -0,0 +1,70 @@
#!/usr/bin/python
import sys
import numpy as np
import scipy.sparse
# append the path to xgboost, you may need to change the following line
# alternatively, you can add the path to PYTHONPATH environment variable
sys.path.append('../../wrapper')
import xgboost as xgb
### simple example
# load file from text file, also binary buffer generated by xgboost
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# specify validations set to watch performance
watchlist = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, watchlist)
# this is prediction
preds = bst.predict(dtest)
labels = dtest.get_label()
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
bst.save_model('0001.model')
# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
bst.dump_model('dump.nice.txt','../data/featmap.txt')
# save dmatrix into binary buffer
dtest.save_binary('dtest.buffer')
bst.save_model('xgb.model')
# load model and data in
bst2 = xgb.Booster(model_file='xgb.model')
dtest2 = xgb.DMatrix('dtest.buffer')
preds2 = bst2.predict(dtest2)
# assert they are the same
assert np.sum(np.abs(preds2-preds)) == 0
###
# build dmatrix from scipy.sparse
print ('start running example of build DMatrix from scipy.sparse')
labels = []
row = []; col = []; dat = []
i = 0
for l in open('../data/agaricus.txt.train'):
arr = l.split()
labels.append( int(arr[0]))
for it in arr[1:]:
k,v = it.split(':')
row.append(i); col.append(int(k)); dat.append(float(v))
i += 1
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
dtrain = xgb.DMatrix( csr )
dtrain.set_label(labels)
watchlist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, watchlist )
print ('start running example of build DMatrix from numpy array')
# NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation,then convert to DMatrix
npymat = csr.todense()
dtrain = xgb.DMatrix( npymat)
dtrain.set_label(labels)
watchlist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, watchlist )

View File

@ -0,0 +1,26 @@
#!/usr/bin/python
import sys
import numpy as np
sys.path.append('../../wrapper')
import xgboost as xgb
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
watchlist = [(dtest,'eval'), (dtrain,'train')]
###
# advanced: start from a initial base prediction
#
print ('start running example to start from a initial prediction')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# train xgboost for 1 round
bst = xgb.train( param, dtrain, 1, watchlist )
# Note: we need the margin value instead of transformed prediction in set_base_margin
# do predict with output_margin=True, will always give you margin values before logistic transformation
ptrain = bst.predict(dtrain, output_margin=True)
ptest = bst.predict(dtest, output_margin=True)
dtrain.set_base_margin(ptrain)
dtest.set_base_margin(ptest)
print ('this is result of running from initial prediction')
bst = xgb.train( param, dtrain, 1, watchlist )

View File

@ -0,0 +1,44 @@
#!/usr/bin/python
import sys
import numpy as np
sys.path.append('../../wrapper')
import xgboost as xgb
###
# advanced: cutomsized loss function
#
print ('start running example to used cutomized objective function')
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
param = {'max_depth':2, 'eta':1, 'silent':1 }
watchlist = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
# user define objective function, given prediction, return gradient and second order gradient
# this is loglikelihood loss
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0-preds)
return grad, hess
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make buildin evalution metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the buildin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
labels = dtrain.get_label()
# return a pair metric_name, result
# since preds are margin(before logistic transformation, cutoff at 0)
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)

View File

@ -0,0 +1,22 @@
#!/usr/bin/python
import sys
import numpy as np
sys.path.append('../../wrapper')
import xgboost as xgb
### load data in do training
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
watchlist = [(dtest,'eval'), (dtrain,'train')]
num_round = 3
bst = xgb.train(param, dtrain, num_round, watchlist)
print ('start testing prediction from first n trees')
### predict using first 1 tree
label = dtest.get_label()
ypred1 = bst.predict(dtest, ntree_limit=1)
# by default, we predict using all the trees
ypred2 = bst.predict(dtest)
print ('error of ypred1=%f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
print ('error of ypred2=%f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))

5
demo/guide-python/runall.sh Executable file
View File

@ -0,0 +1,5 @@
#!/bin/bash
python basic_walkthrough.py
python custom_objective.py
python boost_from_prediction.py
rm *~ *.model *.buffer

View File

@ -24,6 +24,7 @@ class GBLinear : public IGradBooster {
}
// set model parameters
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strncmp(name, "bst:", 4)) {
param.SetParam(name + 4, val);
}
@ -166,6 +167,7 @@ class GBLinear : public IGradBooster {
learning_rate = 1.0f;
}
inline void SetParam(const char *name, const char *val) {
using namespace std;
// sync-names
if (!strcmp("eta", name)) learning_rate = static_cast<float>(atof(val));
if (!strcmp("lambda", name)) reg_lambda = static_cast<float>(atof(val));
@ -207,9 +209,10 @@ class GBLinear : public IGradBooster {
Param(void) {
num_feature = 0;
num_output_group = 1;
memset(reserved, 0, sizeof(reserved));
std::memset(reserved, 0, sizeof(reserved));
}
inline void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp(name, "bst:num_feature")) num_feature = atoi(val);
if (!strcmp(name, "num_output_group")) num_output_group = atoi(val);
}

View File

@ -1,7 +1,6 @@
#define _CRT_SECURE_NO_WARNINGS
#define _CRT_SECURE_NO_DEPRECATE
#include <cstring>
using namespace std;
#include "./gbm.h"
#include "./gbtree-inl.hpp"
#include "./gblinear-inl.hpp"
@ -9,6 +8,7 @@ using namespace std;
namespace xgboost {
namespace gbm {
IGradBooster* CreateGradBooster(const char *name) {
using namespace std;
if (!strcmp("gbtree", name)) return new GBTree();
if (!strcmp("gblinear", name)) return new GBLinear();
utils::Error("unknown booster type: %s", name);

View File

@ -23,6 +23,7 @@ class GBTree : public IGradBooster {
this->Clear();
}
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strncmp(name, "bst:", 4)) {
cfg.push_back(std::make_pair(std::string(name+4), std::string(val)));
// set into updaters, if already intialized
@ -171,14 +172,14 @@ class GBTree : public IGradBooster {
updaters.clear();
std::string tval = tparam.updater_seq;
char *pstr;
pstr = strtok(&tval[0], ",");
pstr = std::strtok(&tval[0], ",");
while (pstr != NULL) {
updaters.push_back(tree::CreateUpdater(pstr));
for (size_t j = 0; j < cfg.size(); ++j) {
// set parameters
updaters.back()->SetParam(cfg[j].first.c_str(), cfg[j].second.c_str());
}
pstr = strtok(NULL, ",");
pstr = std::strtok(NULL, ",");
}
tparam.updater_initialized = 1;
}
@ -279,6 +280,7 @@ class GBTree : public IGradBooster {
updater_initialized = 0;
}
inline void SetParam(const char *name, const char *val){
using namespace std;
if (!strcmp(name, "updater") &&
strcmp(updater_seq.c_str(), val) != 0) {
updater_seq = val;
@ -319,7 +321,7 @@ class GBTree : public IGradBooster {
num_pbuffer = 0;
num_output_group = 1;
size_leaf_vector = 0;
memset(reserved, 0, sizeof(reserved));
std::memset(reserved, 0, sizeof(reserved));
}
/*!
* \brief set parameters from outside
@ -327,6 +329,7 @@ class GBTree : public IGradBooster {
* \param val value of the parameter
*/
inline void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp("num_pbuffer", name)) num_pbuffer = atol(val);
if (!strcmp("num_output_group", name)) num_output_group = atol(val);
if (!strcmp("bst:num_roots", name)) num_roots = atoi(val);

View File

@ -1,7 +1,6 @@
#define _CRT_SECURE_NO_WARNINGS
#define _CRT_SECURE_NO_DEPRECATE
#include <string>
using namespace std;
#include "./io.h"
#include "../utils/io.h"
#include "../utils/utils.h"

View File

@ -55,8 +55,8 @@ class DMatrixSimple : public DataMatrix {
RowBatch::Inst inst = batch[i];
row_data_.resize(row_data_.size() + inst.length);
if (inst.length != 0) {
memcpy(&row_data_[row_ptr_.back()], inst.data,
sizeof(RowBatch::Entry) * inst.length);
std::memcpy(&row_data_[row_ptr_.back()], inst.data,
sizeof(RowBatch::Entry) * inst.length);
}
row_ptr_.push_back(row_ptr_.back() + inst.length);
}
@ -82,6 +82,7 @@ class DMatrixSimple : public DataMatrix {
* \param silent whether print information or not
*/
inline void LoadText(const char* fname, bool silent = false) {
using namespace std;
this->Clear();
FILE* file = utils::FopenCheck(fname, "r");
float label; bool init = true;
@ -135,7 +136,7 @@ class DMatrixSimple : public DataMatrix {
* \return whether loading is success
*/
inline bool LoadBinary(const char* fname, bool silent = false) {
FILE *fp = fopen64(fname, "rb");
std::FILE *fp = fopen64(fname, "rb");
if (fp == NULL) return false;
utils::FileStream fs(fp);
this->LoadBinary(fs, silent, fname);
@ -208,6 +209,7 @@ class DMatrixSimple : public DataMatrix {
* \param savebuffer whether do save binary buffer if it is text
*/
inline void CacheLoad(const char *fname, bool silent = false, bool savebuffer = true) {
using namespace std;
size_t len = strlen(fname);
if (len > 8 && !strcmp(fname + len - 7, ".buffer")) {
if (!this->LoadBinary(fname, silent)) {
@ -216,7 +218,7 @@ class DMatrixSimple : public DataMatrix {
return;
}
char bname[1024];
snprintf(bname, sizeof(bname), "%s.buffer", fname);
utils::SPrintf(bname, sizeof(bname), "%s.buffer", fname);
if (!this->LoadBinary(bname, silent)) {
this->LoadText(fname, silent);
if (savebuffer) this->SaveBinary(bname, silent);

View File

@ -90,6 +90,7 @@ struct MetaInfo {
}
// try to load group information from file, if exists
inline bool TryLoadGroup(const char* fname, bool silent = false) {
using namespace std;
FILE *fi = fopen64(fname, "r");
if (fi == NULL) return false;
group_ptr.push_back(0);
@ -105,6 +106,7 @@ struct MetaInfo {
return true;
}
inline std::vector<float>& GetFloatInfo(const char *field) {
using namespace std;
if (!strcmp(field, "label")) return labels;
if (!strcmp(field, "weight")) return weights;
if (!strcmp(field, "base_margin")) return base_margin;
@ -115,6 +117,7 @@ struct MetaInfo {
return ((MetaInfo*)this)->GetFloatInfo(field);
}
inline std::vector<unsigned> &GetUIntInfo(const char *field) {
using namespace std;
if (!strcmp(field, "root_index")) return info.root_index;
if (!strcmp(field, "fold_index")) return info.fold_index;
utils::Error("unknown field %s", field);
@ -125,6 +128,7 @@ struct MetaInfo {
}
// try to load weight information from file, if exists
inline bool TryLoadFloatInfo(const char *field, const char* fname, bool silent = false) {
using namespace std;
std::vector<float> &data = this->GetFloatInfo(field);
FILE *fi = fopen64(fname, "r");
if (fi == NULL) return false;

View File

@ -147,10 +147,11 @@ struct EvalAMS : public IEvaluator {
explicit EvalAMS(const char *name) {
name_ = name;
// note: ams@0 will automatically select which ratio to go
utils::Check(sscanf(name, "ams@%f", &ratio_) == 1, "invalid ams format");
utils::Check(std::sscanf(name, "ams@%f", &ratio_) == 1, "invalid ams format");
}
virtual float Eval(const std::vector<float> &preds,
const MetaInfo &info) const {
using namespace std;
const bst_omp_uint ndata = static_cast<bst_omp_uint>(info.labels.size());
utils::Check(info.weights.size() == ndata, "we need weight to evaluate ams");
@ -202,6 +203,7 @@ struct EvalAMS : public IEvaluator {
struct EvalPrecisionRatio : public IEvaluator{
public:
explicit EvalPrecisionRatio(const char *name) : name_(name) {
using namespace std;
if (sscanf(name, "apratio@%f", &ratio_) == 1) {
use_ap = 1;
} else {
@ -342,6 +344,7 @@ struct EvalRankList : public IEvaluator {
protected:
explicit EvalRankList(const char *name) {
using namespace std;
name_ = name;
minus_ = false;
if (sscanf(name, "%*[^@]@%u[-]?", &topn_) != 1) {
@ -388,7 +391,7 @@ struct EvalNDCG : public EvalRankList{
for (size_t i = 0; i < rec.size() && i < this->topn_; ++i) {
const unsigned rel = rec[i].second;
if (rel != 0) {
sumdcg += ((1 << rel) - 1) / log(i + 2.0);
sumdcg += ((1 << rel) - 1) / std::log(i + 2.0);
}
}
return static_cast<float>(sumdcg);

View File

@ -36,6 +36,7 @@ struct IEvaluator{
namespace xgboost {
namespace learner {
inline IEvaluator* CreateEvaluator(const char *name) {
using namespace std;
if (!strcmp(name, "rmse")) return new EvalRMSE();
if (!strcmp(name, "error")) return new EvalError();
if (!strcmp(name, "merror")) return new EvalMatchError();
@ -56,6 +57,7 @@ inline IEvaluator* CreateEvaluator(const char *name) {
class EvalSet{
public:
inline void AddEval(const char *name) {
using namespace std;
for (size_t i = 0; i < evals_.size(); ++i) {
if (!strcmp(name, evals_[i]->Name())) return;
}

View File

@ -79,6 +79,7 @@ class BoostLearner {
* \param val value of the parameter
*/
inline void SetParam(const char *name, const char *val) {
using namespace std;
// in this version, bst: prefix is no longer required
if (strncmp(name, "bst:", 4) != 0) {
std::string n = "bst:"; n += name;
@ -290,7 +291,7 @@ class BoostLearner {
base_score = 0.5f;
num_feature = 0;
num_class = 0;
memset(reserved, 0, sizeof(reserved));
std::memset(reserved, 0, sizeof(reserved));
}
/*!
* \brief set parameters from outside
@ -298,6 +299,7 @@ class BoostLearner {
* \param val value of the parameter
*/
inline void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp("base_score", name)) base_score = static_cast<float>(atof(val));
if (!strcmp("num_class", name)) num_class = atoi(val);
if (!strcmp("bst:num_feature", name)) num_feature = atoi(val);

View File

@ -101,6 +101,7 @@ class RegLossObj : public IObjFunction{
}
virtual ~RegLossObj(void) {}
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp("scale_pos_weight", name)) {
scale_pos_weight = static_cast<float>(atof(val));
}
@ -156,6 +157,7 @@ class SoftmaxMultiClassObj : public IObjFunction {
}
virtual ~SoftmaxMultiClassObj(void) {}
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp( "num_class", name )) nclass = atoi(val);
}
virtual void GetGradient(const std::vector<float> &preds,
@ -247,6 +249,7 @@ class LambdaRankObj : public IObjFunction {
}
virtual ~LambdaRankObj(void) {}
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp( "loss_type", name )) loss.loss_type = atoi(val);
if (!strcmp( "fix_list_weight", name)) fix_list_weight = static_cast<float>(atof(val));
if (!strcmp( "num_pairsample", name)) num_pairsample = atoi(val);

View File

@ -67,6 +67,7 @@ namespace xgboost {
namespace learner {
/*! \brief factory funciton to create objective function by name */
inline IObjFunction* CreateObjFunction(const char *name) {
using namespace std;
if (!strcmp("reg:linear", name)) return new RegLossObj(LossType::kLinearSquare);
if (!strcmp("reg:logistic", name)) return new RegLossObj(LossType::kLogisticNeglik);
if (!strcmp("binary:logistic", name)) return new RegLossObj(LossType::kLogisticClassify);

View File

@ -53,7 +53,7 @@ class TreeModel {
Param(void) {
max_depth = 0;
size_leaf_vector = 0;
memset(reserved, 0, sizeof(reserved));
std::memset(reserved, 0, sizeof(reserved));
}
/*!
* \brief set parameters from outside
@ -61,6 +61,7 @@ class TreeModel {
* \param val value of the parameter
*/
inline void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp("num_roots", name)) num_roots = atoi(val);
if (!strcmp("num_feature", name)) num_feature = atoi(val);
if (!strcmp("size_leaf_vector", name)) size_leaf_vector = atoi(val);

View File

@ -65,6 +65,7 @@ struct TrainParam{
* \param val value of the parameter
*/
inline void SetParam(const char *name, const char *val) {
using namespace std;
// sync-names
if (!strcmp(name, "gamma")) min_split_loss = static_cast<float>(atof(val));
if (!strcmp(name, "eta")) learning_rate = static_cast<float>(atof(val));

View File

@ -1,7 +1,6 @@
#define _CRT_SECURE_NO_WARNINGS
#define _CRT_SECURE_NO_DEPRECATE
#include <cstring>
using namespace std;
#include "./updater.h"
#include "./updater_prune-inl.hpp"
#include "./updater_refresh-inl.hpp"
@ -10,6 +9,7 @@ using namespace std;
namespace xgboost {
namespace tree {
IUpdater* CreateUpdater(const char *name) {
using namespace std;
if (!strcmp(name, "prune")) return new TreePruner();
if (!strcmp(name, "refresh")) return new TreeRefresher<GradStats>();
if (!strcmp(name, "grow_colmaker")) return new ColMaker<GradStats>();

View File

@ -85,18 +85,18 @@ class ColMaker: public IUpdater {
const BoosterInfo &info,
RegTree *p_tree) {
this->InitData(gpair, *p_fmat, info.root_index, *p_tree);
this->InitNewNode(qexpand, gpair, *p_fmat, info, *p_tree);
this->InitNewNode(qexpand_, gpair, *p_fmat, info, *p_tree);
for (int depth = 0; depth < param.max_depth; ++depth) {
this->FindSplit(depth, this->qexpand, gpair, p_fmat, info, p_tree);
this->ResetPosition(this->qexpand, p_fmat, *p_tree);
this->UpdateQueueExpand(*p_tree, &this->qexpand);
this->InitNewNode(qexpand, gpair, *p_fmat, info, *p_tree);
this->FindSplit(depth, qexpand_, gpair, p_fmat, info, p_tree);
this->ResetPosition(qexpand_, p_fmat, *p_tree);
this->UpdateQueueExpand(*p_tree, &qexpand_);
this->InitNewNode(qexpand_, gpair, *p_fmat, info, *p_tree);
// if nothing left to be expand, break
if (qexpand.size() == 0) break;
if (qexpand_.size() == 0) break;
}
// set all the rest expanding nodes to leaf
for (size_t i = 0; i < qexpand.size(); ++i) {
const int nid = qexpand[i];
for (size_t i = 0; i < qexpand_.size(); ++i) {
const int nid = qexpand_[i];
(*p_tree)[nid].set_leaf(snode[nid].weight * param.learning_rate);
}
// remember auxiliary statistics in the tree node
@ -169,9 +169,9 @@ class ColMaker: public IUpdater {
snode.reserve(256);
}
{// expand query
qexpand.reserve(256); qexpand.clear();
qexpand_.reserve(256); qexpand_.clear();
for (int i = 0; i < tree.param.num_roots; ++i) {
qexpand.push_back(i);
qexpand_.push_back(i);
}
}
}
@ -233,6 +233,7 @@ class ColMaker: public IUpdater {
const BoosterInfo &info) {
bool need_forward = param.need_forward_search(fmat.GetColDensity(fid));
bool need_backward = param.need_backward_search(fmat.GetColDensity(fid));
const std::vector<int> &qexpand = qexpand_;
int nthread;
#pragma omp parallel
{
@ -362,6 +363,7 @@ class ColMaker: public IUpdater {
const std::vector<bst_gpair> &gpair,
const BoosterInfo &info,
std::vector<ThreadEntry> &temp) {
const std::vector<int> &qexpand = qexpand_;
// clear all the temp statistics
for (size_t j = 0; j < qexpand.size(); ++j) {
temp[qexpand[j]].stats.Clear();
@ -382,7 +384,7 @@ class ColMaker: public IUpdater {
e.last_fvalue = fvalue;
} else {
// try to find a split
if (fabsf(fvalue - e.last_fvalue) > rt_2eps && e.stats.sum_hess >= param.min_child_weight) {
if (std::abs(fvalue - e.last_fvalue) > rt_2eps && e.stats.sum_hess >= param.min_child_weight) {
c.SetSubstract(snode[nid].stats, e.stats);
if (c.sum_hess >= param.min_child_weight) {
bst_float loss_chg = static_cast<bst_float>(e.stats.CalcGain(param) + c.CalcGain(param) - snode[nid].root_gain);
@ -539,7 +541,7 @@ class ColMaker: public IUpdater {
/*! \brief TreeNode Data: statistics for each constructed node */
std::vector<NodeEntry> snode;
/*! \brief queue of nodes to be expanded */
std::vector<int> qexpand;
std::vector<int> qexpand_;
};
};

View File

@ -17,6 +17,7 @@ class TreePruner: public IUpdater {
virtual ~TreePruner(void) {}
// set training parameter
virtual void SetParam(const char *name, const char *val) {
using namespace std;
param.SetParam(name, val);
if (!strcmp(name, "silent")) silent = atoi(val);
}

View File

@ -24,15 +24,15 @@ class FeatMap {
// function definitions
/*! \brief load feature map from text format */
inline void LoadText(const char *fname) {
FILE *fi = utils::FopenCheck(fname, "r");
std::FILE *fi = utils::FopenCheck(fname, "r");
this->LoadText(fi);
fclose(fi);
std::fclose(fi);
}
/*! \brief load feature map from text format */
inline void LoadText(FILE *fi) {
inline void LoadText(std::FILE *fi) {
int fid;
char fname[1256], ftype[1256];
while (fscanf(fi, "%d\t%[^\t]\t%s\n", &fid, fname, ftype) == 3) {
while (std::fscanf(fi, "%d\t%[^\t]\t%s\n", &fid, fname, ftype) == 3) {
this->PushBack(fid, fname, ftype);
}
}
@ -62,6 +62,7 @@ class FeatMap {
private:
inline static Type GetType(const char *tname) {
using namespace std;
if (!strcmp("i", tname)) return kIndicator;
if (!strcmp("q", tname)) return kQuantitive;
if (!strcmp("int", tname)) return kInteger;

View File

@ -105,20 +105,20 @@ class FileStream : public ISeekStream {
this->fp = NULL;
}
virtual size_t Read(void *ptr, size_t size) {
return fread(ptr, size, 1, fp);
return std::fread(ptr, size, 1, fp);
}
virtual void Write(const void *ptr, size_t size) {
fwrite(ptr, size, 1, fp);
std::fwrite(ptr, size, 1, fp);
}
virtual void Seek(long pos) {
fseek(fp, pos, SEEK_SET);
std::fseek(fp, pos, SEEK_SET);
}
virtual long Tell(void) {
return ftell(fp);
return std::ftell(fp);
}
inline void Close(void) {
if (fp != NULL){
fclose(fp); fp = NULL;
std::fclose(fp); fp = NULL;
}
}

View File

@ -53,7 +53,7 @@ inline double NextDouble(void) {
}
/*! \brief return a random number in n */
inline uint32_t NextUInt32(uint32_t n) {
return (uint32_t)floor(NextDouble() * n);
return (uint32_t)std::floor(NextDouble() * n);
}
/*! \brief return x~N(mu,sigma^2) */
inline double SampleNormal(double mu, double sigma) {

View File

@ -86,7 +86,7 @@ void HandlePrint(const char *msg);
#endif
#endif
#ifdef XGBOOST_STRICT_CXX98_
// these function pointers are to be assigned
// these function pointers are to be assigned
extern "C" void (*Printf)(const char *fmt, ...);
extern "C" int (*SPrintf)(char *buf, size_t size, const char *fmt, ...);
extern "C" void (*Assert)(int exp, const char *fmt, ...);
@ -94,7 +94,7 @@ extern "C" void (*Check)(int exp, const char *fmt, ...);
extern "C" void (*Error)(const char *fmt, ...);
#else
/*! \brief printf, print message to the console */
inline void Printf(const char *fmt, ...) {
inline void Printf(const char *fmt, ...) {
std::string msg(kPrintBuffer, '\0');
va_list args;
va_start(args, fmt);
@ -103,7 +103,7 @@ inline void Printf(const char *fmt, ...) {
HandlePrint(msg.c_str());
}
/*! \brief portable version of snprintf */
inline int SPrintf(char *buf, size_t size, const char *fmt, ...) {
inline int SPrintf(char *buf, size_t size, const char *fmt, ...) {
va_list args;
va_start(args, fmt);
int ret = vsnprintf(buf, size, fmt, args);
@ -149,12 +149,12 @@ inline void Error(const char *fmt, ...) {
#endif
/*! \brief replace fopen, report error when the file open fails */
inline FILE *FopenCheck(const char *fname, const char *flag) {
FILE *fp = fopen64(fname, flag);
inline std::FILE *FopenCheck(const char *fname, const char *flag) {
std::FILE *fp = fopen64(fname, flag);
Check(fp != NULL, "can not open file \"%s\"\n", fname);
return fp;
}
} // namespace utils
} // namespace utils
// easy utils that can be directly acessed in xgboost
/*! \brief get the beginning address of a vector */
template<typename T>

View File

@ -2,11 +2,10 @@ Wrapper of XGBoost
=====
This folder provides wrapper of xgboost to other languages
Python
=====
* To make the python module, type ```make``` in the root directory of project
* Refer to the walk through example in [python-example/demo.py](python-example/demo.py)
* Refer also to the walk through example in [demo folder](../demo/guide-python)
R
=====

View File

@ -1,3 +0,0 @@
example to use python xgboost, the data is generated from demo/binary_classification, in libsvm format
for usage: see demo.py and comments in demo.py

View File

@ -1,121 +0,0 @@
#!/usr/bin/python
import sys
import numpy as np
import scipy.sparse
# append the path to xgboost, you may need to change the following line
# alternatively, you can add the path to PYTHONPATH environment variable
sys.path.append('../')
import xgboost as xgb
### simple example
# load file from text file, also binary buffer generated by xgboost
dtrain = xgb.DMatrix('agaricus.txt.train')
dtest = xgb.DMatrix('agaricus.txt.test')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# specify validations set to watch performance
evallist = [(dtest,'eval'), (dtrain,'train')]
num_round = 2
bst = xgb.train(param, dtrain, num_round, evallist)
# this is prediction
preds = bst.predict(dtest)
labels = dtest.get_label()
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
bst.save_model('0001.model')
# dump model
bst.dump_model('dump.raw.txt')
# dump model with feature map
bst.dump_model('dump.nice.txt','featmap.txt')
# save dmatrix into binary buffer
dtest.save_binary('dtest.buffer')
bst.save_model('xgb.model')
# load model and data in
bst2 = xgb.Booster(model_file='xgb.model')
dtest2 = xgb.DMatrix('dtest.buffer')
preds2 = bst2.predict(dtest2)
# assert they are the same
assert np.sum(np.abs(preds2-preds)) == 0
###
# build dmatrix from scipy.sparse
print ('start running example of build DMatrix from scipy.sparse')
labels = []
row = []; col = []; dat = []
i = 0
for l in open('agaricus.txt.train'):
arr = l.split()
labels.append( int(arr[0]))
for it in arr[1:]:
k,v = it.split(':')
row.append(i); col.append(int(k)); dat.append(float(v))
i += 1
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
dtrain = xgb.DMatrix( csr )
dtrain.set_label(labels)
evallist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, evallist )
print ('start running example of build DMatrix from numpy array')
# NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation,then convert to DMatrix
npymat = csr.todense()
dtrain = xgb.DMatrix( npymat)
dtrain.set_label(labels)
evallist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, evallist )
###
# advanced: cutomsized loss function
#
print ('start running example to used cutomized objective function')
# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
param = {'max_depth':2, 'eta':1, 'silent':1 }
# user define objective function, given prediction, return gradient and second order gradient
# this is loglikelihood loss
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds))
grad = preds - labels
hess = preds * (1.0-preds)
return grad, hess
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make buildin evalution metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the buildin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
labels = dtrain.get_label()
# return a pair metric_name, result
# since preds are margin(before logistic transformation, cutoff at 0)
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst = xgb.train(param, dtrain, num_round, evallist, logregobj, evalerror)
###
# advanced: start from a initial base prediction
#
print ('start running example to start from a initial prediction')
# specify parameters via map, definition are same as c++ version
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
# train xgboost for 1 round
bst = xgb.train( param, dtrain, 1, evallist )
# Note: we need the margin value instead of transformed prediction in set_base_margin
# do predict with output_margin=True, will always give you margin values before logistic transformation
ptrain = bst.predict(dtrain, output_margin=True)
ptest = bst.predict(dtest, output_margin=True)
dtrain.set_base_margin(ptrain)
dtest.set_base_margin(ptest)
print ('this is result of running from initial prediction')
bst = xgb.train( param, dtrain, 1, evallist )

View File

@ -53,7 +53,7 @@ class DMatrix:
missing: float
value in data which need to be present as missing value
weight: list or numpy 1d array, optional
weight for each instances
weight for each instances
"""
# force into void_p, mac need to pass things in as void_p
if data is None:
@ -318,7 +318,7 @@ class Booster:
self.handle, ctypes.c_char_p(k.encode('utf-8')),
ctypes.c_char_p(str(v).encode('utf-8')))
def update(self, dtrain, it):
def update(self, dtrain, it, fobj=None):
"""
update
Args:
@ -326,11 +326,19 @@ class Booster:
the training DMatrix
it: int
current iteration number
fobj: function
cutomzied objective function
Returns:
None
"""
assert isinstance(dtrain, DMatrix)
xglib.XGBoosterUpdateOneIter(self.handle, it, dtrain.handle)
if fobj is None:
xglib.XGBoosterUpdateOneIter(self.handle, it, dtrain.handle)
else:
pred = self.predict( dtrain )
grad, hess = fobj( pred, dtrain )
self.boost( dtrain, grad, hess )
def boost(self, dtrain, grad, hess):
""" update
Args:
@ -347,22 +355,32 @@ class Booster:
(ctypes.c_float*len(grad))(*grad),
(ctypes.c_float*len(hess))(*hess),
len(grad))
def eval_set(self, evals, it = 0):
def eval_set(self, evals, it = 0, feval = None):
"""evaluates by metric
Args:
evals: list of tuple (DMatrix, string)
lists of items to be evaluated
it: int
feval: function
custom evaluation function
Returns:
evals result
"""
for d in evals:
assert isinstance(d[0], DMatrix)
assert isinstance(d[1], str)
dmats = (ctypes.c_void_p * len(evals) )(*[ d[0].handle for d in evals])
evnames = (ctypes.c_char_p * len(evals))(
* [ctypes.c_char_p(d[1].encode('utf-8')) for d in evals])
return xglib.XGBoosterEvalOneIter(self.handle, it, dmats, evnames, len(evals))
if feval is None:
for d in evals:
assert isinstance(d[0], DMatrix)
assert isinstance(d[1], str)
dmats = (ctypes.c_void_p * len(evals) )(*[ d[0].handle for d in evals])
evnames = (ctypes.c_char_p * len(evals))(
* [ctypes.c_char_p(d[1].encode('utf-8')) for d in evals])
return xglib.XGBoosterEvalOneIter(self.handle, it, dmats, evnames, len(evals))
else:
res = '[%d]' % it
for dm, evname in evals:
name, val = feval(self.predict(dm), dm)
res += '\t%s-%s:%f' % (evname, name, val)
return res
def eval(self, mat, name = 'eval', it = 0):
return self.eval_set( [(mat,name)], it)
def predict(self, data, output_margin=False, ntree_limit=0):
@ -373,7 +391,6 @@ class Booster:
the dmatrix storing the input
output_margin: bool
whether output raw margin value that is untransformed
ntree_limit: limit number of trees in prediction, default to 0, 0 means using all the trees
Returns:
numpy array of prediction
@ -447,30 +464,6 @@ class Booster:
fmap[fid]+= 1
return fmap
def evaluate(bst, evals, it, feval = None):
"""evaluation on eval set
Args:
bst: XGBoost object
object of XGBoost model
evals: list of tuple (DMatrix, string)
obj need to be evaluated
it: int
feval: optional
Returns:
eval result
"""
if feval != None:
res = '[%d]' % it
for dm, evname in evals:
name, val = feval(bst.predict(dm), dm)
res += '\t%s-%s:%f' % (evname, name, val)
else:
res = bst.eval_set(evals, it)
return res
def train(params, dtrain, num_boost_round = 10, evals = [], obj=None, feval=None):
""" train a booster with given paramaters
Args:
@ -482,26 +475,69 @@ def train(params, dtrain, num_boost_round = 10, evals = [], obj=None, feval=None
num of round to be boosted
evals: list
list of items to be evaluated
obj:
feval:
obj: function
cutomized objective function
feval: function
cutomized evaluation function
"""
bst = Booster(params, [dtrain]+[ d[0] for d in evals ] )
if obj is None:
for i in range(num_boost_round):
bst.update( dtrain, i )
if len(evals) != 0:
sys.stderr.write(evaluate(bst, evals, i, feval).decode()+'\n')
else:
# try customized objective function
for i in range(num_boost_round):
pred = bst.predict( dtrain )
grad, hess = obj( pred, dtrain )
bst.boost( dtrain, grad, hess )
if len(evals) != 0:
sys.stderr.write(evaluate(bst, evals, i, feval)+'\n')
for i in range(num_boost_round):
bst.update( dtrain, i, obj )
if len(evals) != 0:
sys.stderr.write(bst.eval_set(evals, i, feval).decode()+'\n')
return bst
def cv(params, dtrain, num_boost_round = 10, nfold=3, evals = [], obj=None, feval=None):
class CVPack:
def __init__(self, dtrain, dtest, param):
self.dtrain = dtrain
self.dtest = dtest
self.watchlist = watchlist = [ (dtrain,'train'), (dtest, 'test') ]
self.bst = Booster(param, [dtrain,dtest])
def update(self, r, fobj):
self.bst.update(self.dtrain, r, fobj)
def eval(self, r, feval):
return self.bst.eval_set(self.watchlist, r, feval)
def mknfold(dall, nfold, param, seed, evals=[], fpreproc = None):
"""
mk nfold list of cvpack from randidx
"""
np.random.seed(seed)
randidx = np.random.permutation(dall.num_row())
kstep = len(randidx) / nfold
idset = [randidx[ (i*kstep) : min(len(randidx),(i+1)*kstep) ] for i in range(nfold)]
ret = []
for k in range(nfold):
dtrain = dall.slice(np.concatenate([idset[i] for i in range(nfold) if k != i]))
dtest = dall.slice(idset[k])
# run preprocessing on the data set if needed
if fpreproc is not None:
dtrain, dtest, tparam = fpreproc(dtrain, dtest, param.copy())
plst = tparam.items() + [('eval_metric', itm) for itm in evals]
ret.append(CVPack(dtrain, dtest, plst))
return ret
def aggcv(rlist):
"""
aggregate cross validation results
"""
cvmap = {}
ret = rlist[0].split()[0]
for line in rlist:
arr = line.split()
assert ret == arr[0]
for it in arr[1:]:
k, v = it.split(':')
if k not in cvmap:
cvmap[k] = []
cvmap[k].append(float(v))
for k, v in sorted(cvmap.items(), key = lambda x:x[0]):
v = np.array(v)
ret += '\t%s:%f+%f' % (k, np.mean(v), np.std(v))
return ret
def cv(params, dtrain, num_boost_round = 10, nfold=3, eval_metric = [], \
obj = None, feval = None, fpreproc = None):
""" cross validation with given paramaters
Args:
params: dict
@ -512,15 +548,16 @@ def cv(params, dtrain, num_boost_round = 10, nfold=3, evals = [], obj=None, feva
num of round to be boosted
nfold: int
folds to do cv
evals: list
evals: list or
list of items to be evaluated
obj:
feval:
fpreproc: preprocessing function that takes dtrain, dtest,
param and return transformed version of dtrain, dtest, param
"""
plst = list(params.items())+[('eval_metric', itm) for itm in evals]
cvfolds = mknfold(dtrain, nfold, plst, 0)
cvfolds = mknfold(dtrain, nfold, params, 0, eval_metric, fpreproc)
for i in range(num_boost_round):
for f in cvfolds:
f.update(i)
res = aggcv([f.eval(i) for f in cvfolds])
f.update(i, obj)
res = aggcv([f.eval(i, feval) for f in cvfolds])
sys.stderr.write(res+'\n')