Merge branch 'mastet push origin unityr' into unity
This commit is contained in:
commit
46cddb80f4
@ -1,18 +1,18 @@
|
||||
Package: xgboost
|
||||
Type: Package
|
||||
Title: eXtreme Gradient Boosting
|
||||
Version: 0.3-0
|
||||
Version: 0.3-1
|
||||
Date: 2014-08-23
|
||||
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>
|
||||
Maintainer: Tong He <hetong007@gmail.com>
|
||||
Description: This package is a R wrapper of xgboost, which is short for eXtreme
|
||||
Gradient Boosting. It is an efficient and scalable implementation of
|
||||
gradient boosting framework. The package includes efficient linear model
|
||||
solver and tree learning algorithm. The package can automatically do
|
||||
solver and tree learning algorithms. The package can automatically do
|
||||
parallel computation with OpenMP, and it can be more than 10 times faster
|
||||
than existing gradient boosting packages such as gbm. It supports various
|
||||
objective functions, including regression, classification and ranking. The
|
||||
package is made to be extensible, so that user are also allowed to define
|
||||
package is made to be extensible, so that users are also allowed to define
|
||||
their own objectives easily.
|
||||
License: Apache License (== 2.0) | file LICENSE
|
||||
URL: https://github.com/tqchen/xgboost
|
||||
|
||||
@ -52,8 +52,7 @@ This is an introductory document of using the \verb@xgboost@ package in R.
|
||||
and scalable implementation of gradient boosting framework by \citep{friedman2001greedy}.
|
||||
The package includes efficient linear model solver and tree learning algorithm.
|
||||
It supports various objective functions, including regression, classification
|
||||
and ranking. The package is made to be extendible, so that user are also allowed
|
||||
to define there own objectives easily. It has several features:
|
||||
and ranking. The package is made to be extendible, so that users are also allowed to define their own objectives easily. It has several features:
|
||||
\begin{enumerate}
|
||||
\item{Speed: }{\verb@xgboost@ can automatically do parallel computation on
|
||||
Windows and Linux, with openmp. It is generally over 10 times faster than
|
||||
@ -137,13 +136,10 @@ diris = xgb.DMatrix('iris.xgb.DMatrix')
|
||||
|
||||
\section{Advanced Examples}
|
||||
|
||||
The function \verb@xgboost@ is a simple function with less parameters, in order
|
||||
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It
|
||||
is more flexible than \verb@xgboost@, but it requires users to read the document
|
||||
a bit more carefully.
|
||||
The function \verb@xgboost@ is a simple function with less parameter, in order
|
||||
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It is more flexible than \verb@xgboost@, but it requires users to read the document a bit more carefully.
|
||||
|
||||
\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it
|
||||
supports advanced features as custom objective and evaluation functions.
|
||||
\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it supports advanced features as custom objective and evaluation functions.
|
||||
|
||||
<<Customized loss function>>=
|
||||
logregobj <- function(preds, dtrain) {
|
||||
@ -213,3 +209,4 @@ competition.
|
||||
\bibliography{xgboost}
|
||||
|
||||
\end{document}
|
||||
|
||||
|
||||
@ -8,6 +8,8 @@ Turorial and Documentation: https://github.com/tqchen/xgboost/wiki
|
||||
|
||||
Questions and Issues: [https://github.com/tqchen/xgboost/issues](https://github.com/tqchen/xgboost/issues?q=is%3Aissue+label%3Aquestion)
|
||||
|
||||
Examples Code: [demo folder](demo)
|
||||
|
||||
Notes on the Code: [Code Guide](src)
|
||||
|
||||
Features
|
||||
|
||||
25
demo/README.md
Normal file
25
demo/README.md
Normal file
@ -0,0 +1,25 @@
|
||||
XGBoost Examples
|
||||
====
|
||||
This folder contains the all example codes using xgboost.
|
||||
Contribution of exampls, benchmarks is more than welcomed!
|
||||
If you like to share how you use xgboost to solve your problem, send a pull request:)
|
||||
|
||||
Features Walkthrough
|
||||
====
|
||||
This is a list of short codes introducing different functionalities of xgboost and its wrapper.
|
||||
* Basic walkthrough of wrappers. [python](guide-python/basic_walkthrough.py)
|
||||
* Cutomize loss function, and evaluation metric. [python](guide-python/custom_objective.py)
|
||||
* Boosting from existing prediction. [python](guide-python/boost_from_prediction.py)
|
||||
* Predicting using first n trees. [python](guide-python/predict_first_ntree.py)
|
||||
* Cross validation(to come)
|
||||
|
||||
Basic Examples by Tasks
|
||||
====
|
||||
* [Binary classification](binary_classification)
|
||||
* [Multiclass classification](multiclass_classification)
|
||||
* [Regression](regression)
|
||||
* [Learning to Rank](rank)
|
||||
|
||||
Benchmarks
|
||||
====
|
||||
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)
|
||||
2
demo/data/README.md
Normal file
2
demo/data/README.md
Normal file
@ -0,0 +1,2 @@
|
||||
This folder contains processed example dataset used by the demos.
|
||||
Copyright of the dataset belongs to the original copyright holder
|
||||
3
demo/guide-R/README.md
Normal file
3
demo/guide-R/README.md
Normal file
@ -0,0 +1,3 @@
|
||||
XGBoost R Feature Walkthrough
|
||||
====
|
||||
To be finished
|
||||
5
demo/guide-R/runall.sh
Executable file
5
demo/guide-R/runall.sh
Executable file
@ -0,0 +1,5 @@
|
||||
#!/bin/bash
|
||||
# todo
|
||||
Rscript basic_walkthrough.R
|
||||
Rscript custom_objective.R
|
||||
Rscript boost_from_prediction.R
|
||||
6
demo/guide-python/README.md
Normal file
6
demo/guide-python/README.md
Normal file
@ -0,0 +1,6 @@
|
||||
XGBoost Python Feature Walkthrough
|
||||
====
|
||||
* [Basic walkthrough of wrappers](basic_walkthrough.py)
|
||||
* [Cutomize loss function, and evaluation metric](custom_objective.py)
|
||||
* [Boosting from existing prediction](boost_from_prediction.py)
|
||||
* [Predicting using first n trees](predict_first_ntree.py)
|
||||
70
demo/guide-python/basic_walkthrough.py
Executable file
70
demo/guide-python/basic_walkthrough.py
Executable file
@ -0,0 +1,70 @@
|
||||
#!/usr/bin/python
|
||||
import sys
|
||||
import numpy as np
|
||||
import scipy.sparse
|
||||
# append the path to xgboost, you may need to change the following line
|
||||
# alternatively, you can add the path to PYTHONPATH environment variable
|
||||
sys.path.append('../../wrapper')
|
||||
import xgboost as xgb
|
||||
|
||||
### simple example
|
||||
# load file from text file, also binary buffer generated by xgboost
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
|
||||
dtest = xgb.DMatrix('../data/agaricus.txt.test')
|
||||
|
||||
# specify parameters via map, definition are same as c++ version
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
||||
|
||||
# specify validations set to watch performance
|
||||
watchlist = [(dtest,'eval'), (dtrain,'train')]
|
||||
num_round = 2
|
||||
bst = xgb.train(param, dtrain, num_round, watchlist)
|
||||
|
||||
# this is prediction
|
||||
preds = bst.predict(dtest)
|
||||
labels = dtest.get_label()
|
||||
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
|
||||
bst.save_model('0001.model')
|
||||
# dump model
|
||||
bst.dump_model('dump.raw.txt')
|
||||
# dump model with feature map
|
||||
bst.dump_model('dump.nice.txt','../data/featmap.txt')
|
||||
|
||||
# save dmatrix into binary buffer
|
||||
dtest.save_binary('dtest.buffer')
|
||||
bst.save_model('xgb.model')
|
||||
# load model and data in
|
||||
bst2 = xgb.Booster(model_file='xgb.model')
|
||||
dtest2 = xgb.DMatrix('dtest.buffer')
|
||||
preds2 = bst2.predict(dtest2)
|
||||
# assert they are the same
|
||||
assert np.sum(np.abs(preds2-preds)) == 0
|
||||
|
||||
###
|
||||
# build dmatrix from scipy.sparse
|
||||
print ('start running example of build DMatrix from scipy.sparse')
|
||||
labels = []
|
||||
row = []; col = []; dat = []
|
||||
i = 0
|
||||
for l in open('../data/agaricus.txt.train'):
|
||||
arr = l.split()
|
||||
labels.append( int(arr[0]))
|
||||
for it in arr[1:]:
|
||||
k,v = it.split(':')
|
||||
row.append(i); col.append(int(k)); dat.append(float(v))
|
||||
i += 1
|
||||
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
|
||||
dtrain = xgb.DMatrix( csr )
|
||||
dtrain.set_label(labels)
|
||||
watchlist = [(dtest,'eval'), (dtrain,'train')]
|
||||
bst = xgb.train( param, dtrain, num_round, watchlist )
|
||||
|
||||
print ('start running example of build DMatrix from numpy array')
|
||||
# NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation,then convert to DMatrix
|
||||
npymat = csr.todense()
|
||||
dtrain = xgb.DMatrix( npymat)
|
||||
dtrain.set_label(labels)
|
||||
watchlist = [(dtest,'eval'), (dtrain,'train')]
|
||||
bst = xgb.train( param, dtrain, num_round, watchlist )
|
||||
|
||||
|
||||
26
demo/guide-python/boost_from_prediction.py
Executable file
26
demo/guide-python/boost_from_prediction.py
Executable file
@ -0,0 +1,26 @@
|
||||
#!/usr/bin/python
|
||||
import sys
|
||||
import numpy as np
|
||||
sys.path.append('../../wrapper')
|
||||
import xgboost as xgb
|
||||
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
|
||||
dtest = xgb.DMatrix('../data/agaricus.txt.test')
|
||||
watchlist = [(dtest,'eval'), (dtrain,'train')]
|
||||
###
|
||||
# advanced: start from a initial base prediction
|
||||
#
|
||||
print ('start running example to start from a initial prediction')
|
||||
# specify parameters via map, definition are same as c++ version
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
||||
# train xgboost for 1 round
|
||||
bst = xgb.train( param, dtrain, 1, watchlist )
|
||||
# Note: we need the margin value instead of transformed prediction in set_base_margin
|
||||
# do predict with output_margin=True, will always give you margin values before logistic transformation
|
||||
ptrain = bst.predict(dtrain, output_margin=True)
|
||||
ptest = bst.predict(dtest, output_margin=True)
|
||||
dtrain.set_base_margin(ptrain)
|
||||
dtest.set_base_margin(ptest)
|
||||
|
||||
print ('this is result of running from initial prediction')
|
||||
bst = xgb.train( param, dtrain, 1, watchlist )
|
||||
44
demo/guide-python/custom_objective.py
Executable file
44
demo/guide-python/custom_objective.py
Executable file
@ -0,0 +1,44 @@
|
||||
#!/usr/bin/python
|
||||
import sys
|
||||
import numpy as np
|
||||
sys.path.append('../../wrapper')
|
||||
import xgboost as xgb
|
||||
###
|
||||
# advanced: cutomsized loss function
|
||||
#
|
||||
print ('start running example to used cutomized objective function')
|
||||
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
|
||||
dtest = xgb.DMatrix('../data/agaricus.txt.test')
|
||||
|
||||
# note: for customized objective function, we leave objective as default
|
||||
# note: what we are getting is margin value in prediction
|
||||
# you must know what you are doing
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1 }
|
||||
watchlist = [(dtest,'eval'), (dtrain,'train')]
|
||||
num_round = 2
|
||||
|
||||
# user define objective function, given prediction, return gradient and second order gradient
|
||||
# this is loglikelihood loss
|
||||
def logregobj(preds, dtrain):
|
||||
labels = dtrain.get_label()
|
||||
preds = 1.0 / (1.0 + np.exp(-preds))
|
||||
grad = preds - labels
|
||||
hess = preds * (1.0-preds)
|
||||
return grad, hess
|
||||
|
||||
# user defined evaluation function, return a pair metric_name, result
|
||||
# NOTE: when you do customized loss function, the default prediction value is margin
|
||||
# this may make buildin evalution metric not function properly
|
||||
# for example, we are doing logistic loss, the prediction is score before logistic transformation
|
||||
# the buildin evaluation error assumes input is after logistic transformation
|
||||
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
|
||||
def evalerror(preds, dtrain):
|
||||
labels = dtrain.get_label()
|
||||
# return a pair metric_name, result
|
||||
# since preds are margin(before logistic transformation, cutoff at 0)
|
||||
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
|
||||
|
||||
# training with customized objective, we can also do step by step training
|
||||
# simply look at xgboost.py's implementation of train
|
||||
bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
|
||||
22
demo/guide-python/predict_first_ntree.py
Executable file
22
demo/guide-python/predict_first_ntree.py
Executable file
@ -0,0 +1,22 @@
|
||||
#!/usr/bin/python
|
||||
import sys
|
||||
import numpy as np
|
||||
sys.path.append('../../wrapper')
|
||||
import xgboost as xgb
|
||||
|
||||
### load data in do training
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
|
||||
dtest = xgb.DMatrix('../data/agaricus.txt.test')
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
||||
watchlist = [(dtest,'eval'), (dtrain,'train')]
|
||||
num_round = 3
|
||||
bst = xgb.train(param, dtrain, num_round, watchlist)
|
||||
|
||||
print ('start testing prediction from first n trees')
|
||||
### predict using first 1 tree
|
||||
label = dtest.get_label()
|
||||
ypred1 = bst.predict(dtest, ntree_limit=1)
|
||||
# by default, we predict using all the trees
|
||||
ypred2 = bst.predict(dtest)
|
||||
print ('error of ypred1=%f' % (np.sum((ypred1>0.5)!=label) /float(len(label))))
|
||||
print ('error of ypred2=%f' % (np.sum((ypred2>0.5)!=label) /float(len(label))))
|
||||
5
demo/guide-python/runall.sh
Executable file
5
demo/guide-python/runall.sh
Executable file
@ -0,0 +1,5 @@
|
||||
#!/bin/bash
|
||||
python basic_walkthrough.py
|
||||
python custom_objective.py
|
||||
python boost_from_prediction.py
|
||||
rm *~ *.model *.buffer
|
||||
@ -24,6 +24,7 @@ class GBLinear : public IGradBooster {
|
||||
}
|
||||
// set model parameters
|
||||
virtual void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strncmp(name, "bst:", 4)) {
|
||||
param.SetParam(name + 4, val);
|
||||
}
|
||||
@ -166,6 +167,7 @@ class GBLinear : public IGradBooster {
|
||||
learning_rate = 1.0f;
|
||||
}
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
// sync-names
|
||||
if (!strcmp("eta", name)) learning_rate = static_cast<float>(atof(val));
|
||||
if (!strcmp("lambda", name)) reg_lambda = static_cast<float>(atof(val));
|
||||
@ -207,9 +209,10 @@ class GBLinear : public IGradBooster {
|
||||
Param(void) {
|
||||
num_feature = 0;
|
||||
num_output_group = 1;
|
||||
memset(reserved, 0, sizeof(reserved));
|
||||
std::memset(reserved, 0, sizeof(reserved));
|
||||
}
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp(name, "bst:num_feature")) num_feature = atoi(val);
|
||||
if (!strcmp(name, "num_output_group")) num_output_group = atoi(val);
|
||||
}
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
#define _CRT_SECURE_NO_WARNINGS
|
||||
#define _CRT_SECURE_NO_DEPRECATE
|
||||
#include <cstring>
|
||||
using namespace std;
|
||||
#include "./gbm.h"
|
||||
#include "./gbtree-inl.hpp"
|
||||
#include "./gblinear-inl.hpp"
|
||||
@ -9,6 +8,7 @@ using namespace std;
|
||||
namespace xgboost {
|
||||
namespace gbm {
|
||||
IGradBooster* CreateGradBooster(const char *name) {
|
||||
using namespace std;
|
||||
if (!strcmp("gbtree", name)) return new GBTree();
|
||||
if (!strcmp("gblinear", name)) return new GBLinear();
|
||||
utils::Error("unknown booster type: %s", name);
|
||||
|
||||
@ -23,6 +23,7 @@ class GBTree : public IGradBooster {
|
||||
this->Clear();
|
||||
}
|
||||
virtual void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strncmp(name, "bst:", 4)) {
|
||||
cfg.push_back(std::make_pair(std::string(name+4), std::string(val)));
|
||||
// set into updaters, if already intialized
|
||||
@ -171,14 +172,14 @@ class GBTree : public IGradBooster {
|
||||
updaters.clear();
|
||||
std::string tval = tparam.updater_seq;
|
||||
char *pstr;
|
||||
pstr = strtok(&tval[0], ",");
|
||||
pstr = std::strtok(&tval[0], ",");
|
||||
while (pstr != NULL) {
|
||||
updaters.push_back(tree::CreateUpdater(pstr));
|
||||
for (size_t j = 0; j < cfg.size(); ++j) {
|
||||
// set parameters
|
||||
updaters.back()->SetParam(cfg[j].first.c_str(), cfg[j].second.c_str());
|
||||
}
|
||||
pstr = strtok(NULL, ",");
|
||||
pstr = std::strtok(NULL, ",");
|
||||
}
|
||||
tparam.updater_initialized = 1;
|
||||
}
|
||||
@ -279,6 +280,7 @@ class GBTree : public IGradBooster {
|
||||
updater_initialized = 0;
|
||||
}
|
||||
inline void SetParam(const char *name, const char *val){
|
||||
using namespace std;
|
||||
if (!strcmp(name, "updater") &&
|
||||
strcmp(updater_seq.c_str(), val) != 0) {
|
||||
updater_seq = val;
|
||||
@ -319,7 +321,7 @@ class GBTree : public IGradBooster {
|
||||
num_pbuffer = 0;
|
||||
num_output_group = 1;
|
||||
size_leaf_vector = 0;
|
||||
memset(reserved, 0, sizeof(reserved));
|
||||
std::memset(reserved, 0, sizeof(reserved));
|
||||
}
|
||||
/*!
|
||||
* \brief set parameters from outside
|
||||
@ -327,6 +329,7 @@ class GBTree : public IGradBooster {
|
||||
* \param val value of the parameter
|
||||
*/
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp("num_pbuffer", name)) num_pbuffer = atol(val);
|
||||
if (!strcmp("num_output_group", name)) num_output_group = atol(val);
|
||||
if (!strcmp("bst:num_roots", name)) num_roots = atoi(val);
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
#define _CRT_SECURE_NO_WARNINGS
|
||||
#define _CRT_SECURE_NO_DEPRECATE
|
||||
#include <string>
|
||||
using namespace std;
|
||||
#include "./io.h"
|
||||
#include "../utils/io.h"
|
||||
#include "../utils/utils.h"
|
||||
|
||||
@ -55,8 +55,8 @@ class DMatrixSimple : public DataMatrix {
|
||||
RowBatch::Inst inst = batch[i];
|
||||
row_data_.resize(row_data_.size() + inst.length);
|
||||
if (inst.length != 0) {
|
||||
memcpy(&row_data_[row_ptr_.back()], inst.data,
|
||||
sizeof(RowBatch::Entry) * inst.length);
|
||||
std::memcpy(&row_data_[row_ptr_.back()], inst.data,
|
||||
sizeof(RowBatch::Entry) * inst.length);
|
||||
}
|
||||
row_ptr_.push_back(row_ptr_.back() + inst.length);
|
||||
}
|
||||
@ -82,6 +82,7 @@ class DMatrixSimple : public DataMatrix {
|
||||
* \param silent whether print information or not
|
||||
*/
|
||||
inline void LoadText(const char* fname, bool silent = false) {
|
||||
using namespace std;
|
||||
this->Clear();
|
||||
FILE* file = utils::FopenCheck(fname, "r");
|
||||
float label; bool init = true;
|
||||
@ -135,7 +136,7 @@ class DMatrixSimple : public DataMatrix {
|
||||
* \return whether loading is success
|
||||
*/
|
||||
inline bool LoadBinary(const char* fname, bool silent = false) {
|
||||
FILE *fp = fopen64(fname, "rb");
|
||||
std::FILE *fp = fopen64(fname, "rb");
|
||||
if (fp == NULL) return false;
|
||||
utils::FileStream fs(fp);
|
||||
this->LoadBinary(fs, silent, fname);
|
||||
@ -208,6 +209,7 @@ class DMatrixSimple : public DataMatrix {
|
||||
* \param savebuffer whether do save binary buffer if it is text
|
||||
*/
|
||||
inline void CacheLoad(const char *fname, bool silent = false, bool savebuffer = true) {
|
||||
using namespace std;
|
||||
size_t len = strlen(fname);
|
||||
if (len > 8 && !strcmp(fname + len - 7, ".buffer")) {
|
||||
if (!this->LoadBinary(fname, silent)) {
|
||||
@ -216,7 +218,7 @@ class DMatrixSimple : public DataMatrix {
|
||||
return;
|
||||
}
|
||||
char bname[1024];
|
||||
snprintf(bname, sizeof(bname), "%s.buffer", fname);
|
||||
utils::SPrintf(bname, sizeof(bname), "%s.buffer", fname);
|
||||
if (!this->LoadBinary(bname, silent)) {
|
||||
this->LoadText(fname, silent);
|
||||
if (savebuffer) this->SaveBinary(bname, silent);
|
||||
|
||||
@ -90,6 +90,7 @@ struct MetaInfo {
|
||||
}
|
||||
// try to load group information from file, if exists
|
||||
inline bool TryLoadGroup(const char* fname, bool silent = false) {
|
||||
using namespace std;
|
||||
FILE *fi = fopen64(fname, "r");
|
||||
if (fi == NULL) return false;
|
||||
group_ptr.push_back(0);
|
||||
@ -105,6 +106,7 @@ struct MetaInfo {
|
||||
return true;
|
||||
}
|
||||
inline std::vector<float>& GetFloatInfo(const char *field) {
|
||||
using namespace std;
|
||||
if (!strcmp(field, "label")) return labels;
|
||||
if (!strcmp(field, "weight")) return weights;
|
||||
if (!strcmp(field, "base_margin")) return base_margin;
|
||||
@ -115,6 +117,7 @@ struct MetaInfo {
|
||||
return ((MetaInfo*)this)->GetFloatInfo(field);
|
||||
}
|
||||
inline std::vector<unsigned> &GetUIntInfo(const char *field) {
|
||||
using namespace std;
|
||||
if (!strcmp(field, "root_index")) return info.root_index;
|
||||
if (!strcmp(field, "fold_index")) return info.fold_index;
|
||||
utils::Error("unknown field %s", field);
|
||||
@ -125,6 +128,7 @@ struct MetaInfo {
|
||||
}
|
||||
// try to load weight information from file, if exists
|
||||
inline bool TryLoadFloatInfo(const char *field, const char* fname, bool silent = false) {
|
||||
using namespace std;
|
||||
std::vector<float> &data = this->GetFloatInfo(field);
|
||||
FILE *fi = fopen64(fname, "r");
|
||||
if (fi == NULL) return false;
|
||||
|
||||
@ -147,10 +147,11 @@ struct EvalAMS : public IEvaluator {
|
||||
explicit EvalAMS(const char *name) {
|
||||
name_ = name;
|
||||
// note: ams@0 will automatically select which ratio to go
|
||||
utils::Check(sscanf(name, "ams@%f", &ratio_) == 1, "invalid ams format");
|
||||
utils::Check(std::sscanf(name, "ams@%f", &ratio_) == 1, "invalid ams format");
|
||||
}
|
||||
virtual float Eval(const std::vector<float> &preds,
|
||||
const MetaInfo &info) const {
|
||||
using namespace std;
|
||||
const bst_omp_uint ndata = static_cast<bst_omp_uint>(info.labels.size());
|
||||
|
||||
utils::Check(info.weights.size() == ndata, "we need weight to evaluate ams");
|
||||
@ -202,6 +203,7 @@ struct EvalAMS : public IEvaluator {
|
||||
struct EvalPrecisionRatio : public IEvaluator{
|
||||
public:
|
||||
explicit EvalPrecisionRatio(const char *name) : name_(name) {
|
||||
using namespace std;
|
||||
if (sscanf(name, "apratio@%f", &ratio_) == 1) {
|
||||
use_ap = 1;
|
||||
} else {
|
||||
@ -342,6 +344,7 @@ struct EvalRankList : public IEvaluator {
|
||||
|
||||
protected:
|
||||
explicit EvalRankList(const char *name) {
|
||||
using namespace std;
|
||||
name_ = name;
|
||||
minus_ = false;
|
||||
if (sscanf(name, "%*[^@]@%u[-]?", &topn_) != 1) {
|
||||
@ -388,7 +391,7 @@ struct EvalNDCG : public EvalRankList{
|
||||
for (size_t i = 0; i < rec.size() && i < this->topn_; ++i) {
|
||||
const unsigned rel = rec[i].second;
|
||||
if (rel != 0) {
|
||||
sumdcg += ((1 << rel) - 1) / log(i + 2.0);
|
||||
sumdcg += ((1 << rel) - 1) / std::log(i + 2.0);
|
||||
}
|
||||
}
|
||||
return static_cast<float>(sumdcg);
|
||||
|
||||
@ -36,6 +36,7 @@ struct IEvaluator{
|
||||
namespace xgboost {
|
||||
namespace learner {
|
||||
inline IEvaluator* CreateEvaluator(const char *name) {
|
||||
using namespace std;
|
||||
if (!strcmp(name, "rmse")) return new EvalRMSE();
|
||||
if (!strcmp(name, "error")) return new EvalError();
|
||||
if (!strcmp(name, "merror")) return new EvalMatchError();
|
||||
@ -56,6 +57,7 @@ inline IEvaluator* CreateEvaluator(const char *name) {
|
||||
class EvalSet{
|
||||
public:
|
||||
inline void AddEval(const char *name) {
|
||||
using namespace std;
|
||||
for (size_t i = 0; i < evals_.size(); ++i) {
|
||||
if (!strcmp(name, evals_[i]->Name())) return;
|
||||
}
|
||||
|
||||
@ -79,6 +79,7 @@ class BoostLearner {
|
||||
* \param val value of the parameter
|
||||
*/
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
// in this version, bst: prefix is no longer required
|
||||
if (strncmp(name, "bst:", 4) != 0) {
|
||||
std::string n = "bst:"; n += name;
|
||||
@ -290,7 +291,7 @@ class BoostLearner {
|
||||
base_score = 0.5f;
|
||||
num_feature = 0;
|
||||
num_class = 0;
|
||||
memset(reserved, 0, sizeof(reserved));
|
||||
std::memset(reserved, 0, sizeof(reserved));
|
||||
}
|
||||
/*!
|
||||
* \brief set parameters from outside
|
||||
@ -298,6 +299,7 @@ class BoostLearner {
|
||||
* \param val value of the parameter
|
||||
*/
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp("base_score", name)) base_score = static_cast<float>(atof(val));
|
||||
if (!strcmp("num_class", name)) num_class = atoi(val);
|
||||
if (!strcmp("bst:num_feature", name)) num_feature = atoi(val);
|
||||
|
||||
@ -101,6 +101,7 @@ class RegLossObj : public IObjFunction{
|
||||
}
|
||||
virtual ~RegLossObj(void) {}
|
||||
virtual void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp("scale_pos_weight", name)) {
|
||||
scale_pos_weight = static_cast<float>(atof(val));
|
||||
}
|
||||
@ -156,6 +157,7 @@ class SoftmaxMultiClassObj : public IObjFunction {
|
||||
}
|
||||
virtual ~SoftmaxMultiClassObj(void) {}
|
||||
virtual void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp( "num_class", name )) nclass = atoi(val);
|
||||
}
|
||||
virtual void GetGradient(const std::vector<float> &preds,
|
||||
@ -247,6 +249,7 @@ class LambdaRankObj : public IObjFunction {
|
||||
}
|
||||
virtual ~LambdaRankObj(void) {}
|
||||
virtual void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp( "loss_type", name )) loss.loss_type = atoi(val);
|
||||
if (!strcmp( "fix_list_weight", name)) fix_list_weight = static_cast<float>(atof(val));
|
||||
if (!strcmp( "num_pairsample", name)) num_pairsample = atoi(val);
|
||||
|
||||
@ -67,6 +67,7 @@ namespace xgboost {
|
||||
namespace learner {
|
||||
/*! \brief factory funciton to create objective function by name */
|
||||
inline IObjFunction* CreateObjFunction(const char *name) {
|
||||
using namespace std;
|
||||
if (!strcmp("reg:linear", name)) return new RegLossObj(LossType::kLinearSquare);
|
||||
if (!strcmp("reg:logistic", name)) return new RegLossObj(LossType::kLogisticNeglik);
|
||||
if (!strcmp("binary:logistic", name)) return new RegLossObj(LossType::kLogisticClassify);
|
||||
|
||||
@ -53,7 +53,7 @@ class TreeModel {
|
||||
Param(void) {
|
||||
max_depth = 0;
|
||||
size_leaf_vector = 0;
|
||||
memset(reserved, 0, sizeof(reserved));
|
||||
std::memset(reserved, 0, sizeof(reserved));
|
||||
}
|
||||
/*!
|
||||
* \brief set parameters from outside
|
||||
@ -61,6 +61,7 @@ class TreeModel {
|
||||
* \param val value of the parameter
|
||||
*/
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
if (!strcmp("num_roots", name)) num_roots = atoi(val);
|
||||
if (!strcmp("num_feature", name)) num_feature = atoi(val);
|
||||
if (!strcmp("size_leaf_vector", name)) size_leaf_vector = atoi(val);
|
||||
|
||||
@ -65,6 +65,7 @@ struct TrainParam{
|
||||
* \param val value of the parameter
|
||||
*/
|
||||
inline void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
// sync-names
|
||||
if (!strcmp(name, "gamma")) min_split_loss = static_cast<float>(atof(val));
|
||||
if (!strcmp(name, "eta")) learning_rate = static_cast<float>(atof(val));
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
#define _CRT_SECURE_NO_WARNINGS
|
||||
#define _CRT_SECURE_NO_DEPRECATE
|
||||
#include <cstring>
|
||||
using namespace std;
|
||||
#include "./updater.h"
|
||||
#include "./updater_prune-inl.hpp"
|
||||
#include "./updater_refresh-inl.hpp"
|
||||
@ -10,6 +9,7 @@ using namespace std;
|
||||
namespace xgboost {
|
||||
namespace tree {
|
||||
IUpdater* CreateUpdater(const char *name) {
|
||||
using namespace std;
|
||||
if (!strcmp(name, "prune")) return new TreePruner();
|
||||
if (!strcmp(name, "refresh")) return new TreeRefresher<GradStats>();
|
||||
if (!strcmp(name, "grow_colmaker")) return new ColMaker<GradStats>();
|
||||
|
||||
@ -85,18 +85,18 @@ class ColMaker: public IUpdater {
|
||||
const BoosterInfo &info,
|
||||
RegTree *p_tree) {
|
||||
this->InitData(gpair, *p_fmat, info.root_index, *p_tree);
|
||||
this->InitNewNode(qexpand, gpair, *p_fmat, info, *p_tree);
|
||||
this->InitNewNode(qexpand_, gpair, *p_fmat, info, *p_tree);
|
||||
for (int depth = 0; depth < param.max_depth; ++depth) {
|
||||
this->FindSplit(depth, this->qexpand, gpair, p_fmat, info, p_tree);
|
||||
this->ResetPosition(this->qexpand, p_fmat, *p_tree);
|
||||
this->UpdateQueueExpand(*p_tree, &this->qexpand);
|
||||
this->InitNewNode(qexpand, gpair, *p_fmat, info, *p_tree);
|
||||
this->FindSplit(depth, qexpand_, gpair, p_fmat, info, p_tree);
|
||||
this->ResetPosition(qexpand_, p_fmat, *p_tree);
|
||||
this->UpdateQueueExpand(*p_tree, &qexpand_);
|
||||
this->InitNewNode(qexpand_, gpair, *p_fmat, info, *p_tree);
|
||||
// if nothing left to be expand, break
|
||||
if (qexpand.size() == 0) break;
|
||||
if (qexpand_.size() == 0) break;
|
||||
}
|
||||
// set all the rest expanding nodes to leaf
|
||||
for (size_t i = 0; i < qexpand.size(); ++i) {
|
||||
const int nid = qexpand[i];
|
||||
for (size_t i = 0; i < qexpand_.size(); ++i) {
|
||||
const int nid = qexpand_[i];
|
||||
(*p_tree)[nid].set_leaf(snode[nid].weight * param.learning_rate);
|
||||
}
|
||||
// remember auxiliary statistics in the tree node
|
||||
@ -169,9 +169,9 @@ class ColMaker: public IUpdater {
|
||||
snode.reserve(256);
|
||||
}
|
||||
{// expand query
|
||||
qexpand.reserve(256); qexpand.clear();
|
||||
qexpand_.reserve(256); qexpand_.clear();
|
||||
for (int i = 0; i < tree.param.num_roots; ++i) {
|
||||
qexpand.push_back(i);
|
||||
qexpand_.push_back(i);
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -233,6 +233,7 @@ class ColMaker: public IUpdater {
|
||||
const BoosterInfo &info) {
|
||||
bool need_forward = param.need_forward_search(fmat.GetColDensity(fid));
|
||||
bool need_backward = param.need_backward_search(fmat.GetColDensity(fid));
|
||||
const std::vector<int> &qexpand = qexpand_;
|
||||
int nthread;
|
||||
#pragma omp parallel
|
||||
{
|
||||
@ -362,6 +363,7 @@ class ColMaker: public IUpdater {
|
||||
const std::vector<bst_gpair> &gpair,
|
||||
const BoosterInfo &info,
|
||||
std::vector<ThreadEntry> &temp) {
|
||||
const std::vector<int> &qexpand = qexpand_;
|
||||
// clear all the temp statistics
|
||||
for (size_t j = 0; j < qexpand.size(); ++j) {
|
||||
temp[qexpand[j]].stats.Clear();
|
||||
@ -382,7 +384,7 @@ class ColMaker: public IUpdater {
|
||||
e.last_fvalue = fvalue;
|
||||
} else {
|
||||
// try to find a split
|
||||
if (fabsf(fvalue - e.last_fvalue) > rt_2eps && e.stats.sum_hess >= param.min_child_weight) {
|
||||
if (std::abs(fvalue - e.last_fvalue) > rt_2eps && e.stats.sum_hess >= param.min_child_weight) {
|
||||
c.SetSubstract(snode[nid].stats, e.stats);
|
||||
if (c.sum_hess >= param.min_child_weight) {
|
||||
bst_float loss_chg = static_cast<bst_float>(e.stats.CalcGain(param) + c.CalcGain(param) - snode[nid].root_gain);
|
||||
@ -539,7 +541,7 @@ class ColMaker: public IUpdater {
|
||||
/*! \brief TreeNode Data: statistics for each constructed node */
|
||||
std::vector<NodeEntry> snode;
|
||||
/*! \brief queue of nodes to be expanded */
|
||||
std::vector<int> qexpand;
|
||||
std::vector<int> qexpand_;
|
||||
};
|
||||
};
|
||||
|
||||
|
||||
@ -17,6 +17,7 @@ class TreePruner: public IUpdater {
|
||||
virtual ~TreePruner(void) {}
|
||||
// set training parameter
|
||||
virtual void SetParam(const char *name, const char *val) {
|
||||
using namespace std;
|
||||
param.SetParam(name, val);
|
||||
if (!strcmp(name, "silent")) silent = atoi(val);
|
||||
}
|
||||
|
||||
@ -24,15 +24,15 @@ class FeatMap {
|
||||
// function definitions
|
||||
/*! \brief load feature map from text format */
|
||||
inline void LoadText(const char *fname) {
|
||||
FILE *fi = utils::FopenCheck(fname, "r");
|
||||
std::FILE *fi = utils::FopenCheck(fname, "r");
|
||||
this->LoadText(fi);
|
||||
fclose(fi);
|
||||
std::fclose(fi);
|
||||
}
|
||||
/*! \brief load feature map from text format */
|
||||
inline void LoadText(FILE *fi) {
|
||||
inline void LoadText(std::FILE *fi) {
|
||||
int fid;
|
||||
char fname[1256], ftype[1256];
|
||||
while (fscanf(fi, "%d\t%[^\t]\t%s\n", &fid, fname, ftype) == 3) {
|
||||
while (std::fscanf(fi, "%d\t%[^\t]\t%s\n", &fid, fname, ftype) == 3) {
|
||||
this->PushBack(fid, fname, ftype);
|
||||
}
|
||||
}
|
||||
@ -62,6 +62,7 @@ class FeatMap {
|
||||
|
||||
private:
|
||||
inline static Type GetType(const char *tname) {
|
||||
using namespace std;
|
||||
if (!strcmp("i", tname)) return kIndicator;
|
||||
if (!strcmp("q", tname)) return kQuantitive;
|
||||
if (!strcmp("int", tname)) return kInteger;
|
||||
|
||||
@ -105,20 +105,20 @@ class FileStream : public ISeekStream {
|
||||
this->fp = NULL;
|
||||
}
|
||||
virtual size_t Read(void *ptr, size_t size) {
|
||||
return fread(ptr, size, 1, fp);
|
||||
return std::fread(ptr, size, 1, fp);
|
||||
}
|
||||
virtual void Write(const void *ptr, size_t size) {
|
||||
fwrite(ptr, size, 1, fp);
|
||||
std::fwrite(ptr, size, 1, fp);
|
||||
}
|
||||
virtual void Seek(long pos) {
|
||||
fseek(fp, pos, SEEK_SET);
|
||||
std::fseek(fp, pos, SEEK_SET);
|
||||
}
|
||||
virtual long Tell(void) {
|
||||
return ftell(fp);
|
||||
return std::ftell(fp);
|
||||
}
|
||||
inline void Close(void) {
|
||||
if (fp != NULL){
|
||||
fclose(fp); fp = NULL;
|
||||
std::fclose(fp); fp = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@ -53,7 +53,7 @@ inline double NextDouble(void) {
|
||||
}
|
||||
/*! \brief return a random number in n */
|
||||
inline uint32_t NextUInt32(uint32_t n) {
|
||||
return (uint32_t)floor(NextDouble() * n);
|
||||
return (uint32_t)std::floor(NextDouble() * n);
|
||||
}
|
||||
/*! \brief return x~N(mu,sigma^2) */
|
||||
inline double SampleNormal(double mu, double sigma) {
|
||||
|
||||
@ -86,7 +86,7 @@ void HandlePrint(const char *msg);
|
||||
#endif
|
||||
#endif
|
||||
#ifdef XGBOOST_STRICT_CXX98_
|
||||
// these function pointers are to be assigned
|
||||
// these function pointers are to be assigned
|
||||
extern "C" void (*Printf)(const char *fmt, ...);
|
||||
extern "C" int (*SPrintf)(char *buf, size_t size, const char *fmt, ...);
|
||||
extern "C" void (*Assert)(int exp, const char *fmt, ...);
|
||||
@ -94,7 +94,7 @@ extern "C" void (*Check)(int exp, const char *fmt, ...);
|
||||
extern "C" void (*Error)(const char *fmt, ...);
|
||||
#else
|
||||
/*! \brief printf, print message to the console */
|
||||
inline void Printf(const char *fmt, ...) {
|
||||
inline void Printf(const char *fmt, ...) {
|
||||
std::string msg(kPrintBuffer, '\0');
|
||||
va_list args;
|
||||
va_start(args, fmt);
|
||||
@ -103,7 +103,7 @@ inline void Printf(const char *fmt, ...) {
|
||||
HandlePrint(msg.c_str());
|
||||
}
|
||||
/*! \brief portable version of snprintf */
|
||||
inline int SPrintf(char *buf, size_t size, const char *fmt, ...) {
|
||||
inline int SPrintf(char *buf, size_t size, const char *fmt, ...) {
|
||||
va_list args;
|
||||
va_start(args, fmt);
|
||||
int ret = vsnprintf(buf, size, fmt, args);
|
||||
@ -149,12 +149,12 @@ inline void Error(const char *fmt, ...) {
|
||||
#endif
|
||||
|
||||
/*! \brief replace fopen, report error when the file open fails */
|
||||
inline FILE *FopenCheck(const char *fname, const char *flag) {
|
||||
FILE *fp = fopen64(fname, flag);
|
||||
inline std::FILE *FopenCheck(const char *fname, const char *flag) {
|
||||
std::FILE *fp = fopen64(fname, flag);
|
||||
Check(fp != NULL, "can not open file \"%s\"\n", fname);
|
||||
return fp;
|
||||
}
|
||||
} // namespace utils
|
||||
} // namespace utils
|
||||
// easy utils that can be directly acessed in xgboost
|
||||
/*! \brief get the beginning address of a vector */
|
||||
template<typename T>
|
||||
|
||||
@ -2,11 +2,10 @@ Wrapper of XGBoost
|
||||
=====
|
||||
This folder provides wrapper of xgboost to other languages
|
||||
|
||||
|
||||
Python
|
||||
=====
|
||||
* To make the python module, type ```make``` in the root directory of project
|
||||
* Refer to the walk through example in [python-example/demo.py](python-example/demo.py)
|
||||
* Refer also to the walk through example in [demo folder](../demo/guide-python)
|
||||
|
||||
R
|
||||
=====
|
||||
|
||||
@ -1,3 +0,0 @@
|
||||
example to use python xgboost, the data is generated from demo/binary_classification, in libsvm format
|
||||
|
||||
for usage: see demo.py and comments in demo.py
|
||||
@ -1,121 +0,0 @@
|
||||
#!/usr/bin/python
|
||||
import sys
|
||||
import numpy as np
|
||||
import scipy.sparse
|
||||
# append the path to xgboost, you may need to change the following line
|
||||
# alternatively, you can add the path to PYTHONPATH environment variable
|
||||
sys.path.append('../')
|
||||
import xgboost as xgb
|
||||
|
||||
### simple example
|
||||
# load file from text file, also binary buffer generated by xgboost
|
||||
dtrain = xgb.DMatrix('agaricus.txt.train')
|
||||
dtest = xgb.DMatrix('agaricus.txt.test')
|
||||
|
||||
# specify parameters via map, definition are same as c++ version
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
||||
|
||||
# specify validations set to watch performance
|
||||
evallist = [(dtest,'eval'), (dtrain,'train')]
|
||||
num_round = 2
|
||||
bst = xgb.train(param, dtrain, num_round, evallist)
|
||||
|
||||
# this is prediction
|
||||
preds = bst.predict(dtest)
|
||||
labels = dtest.get_label()
|
||||
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
|
||||
bst.save_model('0001.model')
|
||||
# dump model
|
||||
bst.dump_model('dump.raw.txt')
|
||||
# dump model with feature map
|
||||
bst.dump_model('dump.nice.txt','featmap.txt')
|
||||
|
||||
# save dmatrix into binary buffer
|
||||
dtest.save_binary('dtest.buffer')
|
||||
bst.save_model('xgb.model')
|
||||
# load model and data in
|
||||
bst2 = xgb.Booster(model_file='xgb.model')
|
||||
dtest2 = xgb.DMatrix('dtest.buffer')
|
||||
preds2 = bst2.predict(dtest2)
|
||||
# assert they are the same
|
||||
assert np.sum(np.abs(preds2-preds)) == 0
|
||||
|
||||
###
|
||||
# build dmatrix from scipy.sparse
|
||||
print ('start running example of build DMatrix from scipy.sparse')
|
||||
labels = []
|
||||
row = []; col = []; dat = []
|
||||
i = 0
|
||||
for l in open('agaricus.txt.train'):
|
||||
arr = l.split()
|
||||
labels.append( int(arr[0]))
|
||||
for it in arr[1:]:
|
||||
k,v = it.split(':')
|
||||
row.append(i); col.append(int(k)); dat.append(float(v))
|
||||
i += 1
|
||||
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
|
||||
dtrain = xgb.DMatrix( csr )
|
||||
dtrain.set_label(labels)
|
||||
evallist = [(dtest,'eval'), (dtrain,'train')]
|
||||
bst = xgb.train( param, dtrain, num_round, evallist )
|
||||
|
||||
print ('start running example of build DMatrix from numpy array')
|
||||
# NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation,then convert to DMatrix
|
||||
npymat = csr.todense()
|
||||
dtrain = xgb.DMatrix( npymat)
|
||||
dtrain.set_label(labels)
|
||||
evallist = [(dtest,'eval'), (dtrain,'train')]
|
||||
bst = xgb.train( param, dtrain, num_round, evallist )
|
||||
|
||||
###
|
||||
# advanced: cutomsized loss function
|
||||
#
|
||||
print ('start running example to used cutomized objective function')
|
||||
|
||||
# note: for customized objective function, we leave objective as default
|
||||
# note: what we are getting is margin value in prediction
|
||||
# you must know what you are doing
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1 }
|
||||
|
||||
# user define objective function, given prediction, return gradient and second order gradient
|
||||
# this is loglikelihood loss
|
||||
def logregobj(preds, dtrain):
|
||||
labels = dtrain.get_label()
|
||||
preds = 1.0 / (1.0 + np.exp(-preds))
|
||||
grad = preds - labels
|
||||
hess = preds * (1.0-preds)
|
||||
return grad, hess
|
||||
|
||||
# user defined evaluation function, return a pair metric_name, result
|
||||
# NOTE: when you do customized loss function, the default prediction value is margin
|
||||
# this may make buildin evalution metric not function properly
|
||||
# for example, we are doing logistic loss, the prediction is score before logistic transformation
|
||||
# the buildin evaluation error assumes input is after logistic transformation
|
||||
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
|
||||
def evalerror(preds, dtrain):
|
||||
labels = dtrain.get_label()
|
||||
# return a pair metric_name, result
|
||||
# since preds are margin(before logistic transformation, cutoff at 0)
|
||||
return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
|
||||
|
||||
# training with customized objective, we can also do step by step training
|
||||
# simply look at xgboost.py's implementation of train
|
||||
bst = xgb.train(param, dtrain, num_round, evallist, logregobj, evalerror)
|
||||
|
||||
###
|
||||
# advanced: start from a initial base prediction
|
||||
#
|
||||
print ('start running example to start from a initial prediction')
|
||||
# specify parameters via map, definition are same as c++ version
|
||||
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
||||
# train xgboost for 1 round
|
||||
bst = xgb.train( param, dtrain, 1, evallist )
|
||||
# Note: we need the margin value instead of transformed prediction in set_base_margin
|
||||
# do predict with output_margin=True, will always give you margin values before logistic transformation
|
||||
ptrain = bst.predict(dtrain, output_margin=True)
|
||||
ptest = bst.predict(dtest, output_margin=True)
|
||||
dtrain.set_base_margin(ptrain)
|
||||
dtest.set_base_margin(ptest)
|
||||
|
||||
print ('this is result of running from initial prediction')
|
||||
bst = xgb.train( param, dtrain, 1, evallist )
|
||||
@ -53,7 +53,7 @@ class DMatrix:
|
||||
missing: float
|
||||
value in data which need to be present as missing value
|
||||
weight: list or numpy 1d array, optional
|
||||
weight for each instances
|
||||
weight for each instances
|
||||
"""
|
||||
# force into void_p, mac need to pass things in as void_p
|
||||
if data is None:
|
||||
@ -318,7 +318,7 @@ class Booster:
|
||||
self.handle, ctypes.c_char_p(k.encode('utf-8')),
|
||||
ctypes.c_char_p(str(v).encode('utf-8')))
|
||||
|
||||
def update(self, dtrain, it):
|
||||
def update(self, dtrain, it, fobj=None):
|
||||
"""
|
||||
update
|
||||
Args:
|
||||
@ -326,11 +326,19 @@ class Booster:
|
||||
the training DMatrix
|
||||
it: int
|
||||
current iteration number
|
||||
fobj: function
|
||||
cutomzied objective function
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
assert isinstance(dtrain, DMatrix)
|
||||
xglib.XGBoosterUpdateOneIter(self.handle, it, dtrain.handle)
|
||||
if fobj is None:
|
||||
xglib.XGBoosterUpdateOneIter(self.handle, it, dtrain.handle)
|
||||
else:
|
||||
pred = self.predict( dtrain )
|
||||
grad, hess = fobj( pred, dtrain )
|
||||
self.boost( dtrain, grad, hess )
|
||||
|
||||
def boost(self, dtrain, grad, hess):
|
||||
""" update
|
||||
Args:
|
||||
@ -347,22 +355,32 @@ class Booster:
|
||||
(ctypes.c_float*len(grad))(*grad),
|
||||
(ctypes.c_float*len(hess))(*hess),
|
||||
len(grad))
|
||||
def eval_set(self, evals, it = 0):
|
||||
|
||||
def eval_set(self, evals, it = 0, feval = None):
|
||||
"""evaluates by metric
|
||||
Args:
|
||||
evals: list of tuple (DMatrix, string)
|
||||
lists of items to be evaluated
|
||||
it: int
|
||||
feval: function
|
||||
custom evaluation function
|
||||
Returns:
|
||||
evals result
|
||||
"""
|
||||
for d in evals:
|
||||
assert isinstance(d[0], DMatrix)
|
||||
assert isinstance(d[1], str)
|
||||
dmats = (ctypes.c_void_p * len(evals) )(*[ d[0].handle for d in evals])
|
||||
evnames = (ctypes.c_char_p * len(evals))(
|
||||
* [ctypes.c_char_p(d[1].encode('utf-8')) for d in evals])
|
||||
return xglib.XGBoosterEvalOneIter(self.handle, it, dmats, evnames, len(evals))
|
||||
if feval is None:
|
||||
for d in evals:
|
||||
assert isinstance(d[0], DMatrix)
|
||||
assert isinstance(d[1], str)
|
||||
dmats = (ctypes.c_void_p * len(evals) )(*[ d[0].handle for d in evals])
|
||||
evnames = (ctypes.c_char_p * len(evals))(
|
||||
* [ctypes.c_char_p(d[1].encode('utf-8')) for d in evals])
|
||||
return xglib.XGBoosterEvalOneIter(self.handle, it, dmats, evnames, len(evals))
|
||||
else:
|
||||
res = '[%d]' % it
|
||||
for dm, evname in evals:
|
||||
name, val = feval(self.predict(dm), dm)
|
||||
res += '\t%s-%s:%f' % (evname, name, val)
|
||||
return res
|
||||
def eval(self, mat, name = 'eval', it = 0):
|
||||
return self.eval_set( [(mat,name)], it)
|
||||
def predict(self, data, output_margin=False, ntree_limit=0):
|
||||
@ -373,7 +391,6 @@ class Booster:
|
||||
the dmatrix storing the input
|
||||
output_margin: bool
|
||||
whether output raw margin value that is untransformed
|
||||
|
||||
ntree_limit: limit number of trees in prediction, default to 0, 0 means using all the trees
|
||||
Returns:
|
||||
numpy array of prediction
|
||||
@ -447,30 +464,6 @@ class Booster:
|
||||
fmap[fid]+= 1
|
||||
return fmap
|
||||
|
||||
def evaluate(bst, evals, it, feval = None):
|
||||
"""evaluation on eval set
|
||||
Args:
|
||||
bst: XGBoost object
|
||||
object of XGBoost model
|
||||
evals: list of tuple (DMatrix, string)
|
||||
obj need to be evaluated
|
||||
it: int
|
||||
feval: optional
|
||||
Returns:
|
||||
eval result
|
||||
"""
|
||||
if feval != None:
|
||||
res = '[%d]' % it
|
||||
for dm, evname in evals:
|
||||
name, val = feval(bst.predict(dm), dm)
|
||||
res += '\t%s-%s:%f' % (evname, name, val)
|
||||
else:
|
||||
res = bst.eval_set(evals, it)
|
||||
|
||||
return res
|
||||
|
||||
|
||||
|
||||
def train(params, dtrain, num_boost_round = 10, evals = [], obj=None, feval=None):
|
||||
""" train a booster with given paramaters
|
||||
Args:
|
||||
@ -482,26 +475,69 @@ def train(params, dtrain, num_boost_round = 10, evals = [], obj=None, feval=None
|
||||
num of round to be boosted
|
||||
evals: list
|
||||
list of items to be evaluated
|
||||
obj:
|
||||
feval:
|
||||
obj: function
|
||||
cutomized objective function
|
||||
feval: function
|
||||
cutomized evaluation function
|
||||
"""
|
||||
bst = Booster(params, [dtrain]+[ d[0] for d in evals ] )
|
||||
if obj is None:
|
||||
for i in range(num_boost_round):
|
||||
bst.update( dtrain, i )
|
||||
if len(evals) != 0:
|
||||
sys.stderr.write(evaluate(bst, evals, i, feval).decode()+'\n')
|
||||
else:
|
||||
# try customized objective function
|
||||
for i in range(num_boost_round):
|
||||
pred = bst.predict( dtrain )
|
||||
grad, hess = obj( pred, dtrain )
|
||||
bst.boost( dtrain, grad, hess )
|
||||
if len(evals) != 0:
|
||||
sys.stderr.write(evaluate(bst, evals, i, feval)+'\n')
|
||||
for i in range(num_boost_round):
|
||||
bst.update( dtrain, i, obj )
|
||||
if len(evals) != 0:
|
||||
sys.stderr.write(bst.eval_set(evals, i, feval).decode()+'\n')
|
||||
return bst
|
||||
|
||||
def cv(params, dtrain, num_boost_round = 10, nfold=3, evals = [], obj=None, feval=None):
|
||||
class CVPack:
|
||||
def __init__(self, dtrain, dtest, param):
|
||||
self.dtrain = dtrain
|
||||
self.dtest = dtest
|
||||
self.watchlist = watchlist = [ (dtrain,'train'), (dtest, 'test') ]
|
||||
self.bst = Booster(param, [dtrain,dtest])
|
||||
def update(self, r, fobj):
|
||||
self.bst.update(self.dtrain, r, fobj)
|
||||
def eval(self, r, feval):
|
||||
return self.bst.eval_set(self.watchlist, r, feval)
|
||||
|
||||
def mknfold(dall, nfold, param, seed, evals=[], fpreproc = None):
|
||||
"""
|
||||
mk nfold list of cvpack from randidx
|
||||
"""
|
||||
np.random.seed(seed)
|
||||
randidx = np.random.permutation(dall.num_row())
|
||||
kstep = len(randidx) / nfold
|
||||
idset = [randidx[ (i*kstep) : min(len(randidx),(i+1)*kstep) ] for i in range(nfold)]
|
||||
ret = []
|
||||
for k in range(nfold):
|
||||
dtrain = dall.slice(np.concatenate([idset[i] for i in range(nfold) if k != i]))
|
||||
dtest = dall.slice(idset[k])
|
||||
# run preprocessing on the data set if needed
|
||||
if fpreproc is not None:
|
||||
dtrain, dtest, tparam = fpreproc(dtrain, dtest, param.copy())
|
||||
plst = tparam.items() + [('eval_metric', itm) for itm in evals]
|
||||
ret.append(CVPack(dtrain, dtest, plst))
|
||||
return ret
|
||||
|
||||
def aggcv(rlist):
|
||||
"""
|
||||
aggregate cross validation results
|
||||
"""
|
||||
cvmap = {}
|
||||
ret = rlist[0].split()[0]
|
||||
for line in rlist:
|
||||
arr = line.split()
|
||||
assert ret == arr[0]
|
||||
for it in arr[1:]:
|
||||
k, v = it.split(':')
|
||||
if k not in cvmap:
|
||||
cvmap[k] = []
|
||||
cvmap[k].append(float(v))
|
||||
for k, v in sorted(cvmap.items(), key = lambda x:x[0]):
|
||||
v = np.array(v)
|
||||
ret += '\t%s:%f+%f' % (k, np.mean(v), np.std(v))
|
||||
return ret
|
||||
|
||||
def cv(params, dtrain, num_boost_round = 10, nfold=3, eval_metric = [], \
|
||||
obj = None, feval = None, fpreproc = None):
|
||||
""" cross validation with given paramaters
|
||||
Args:
|
||||
params: dict
|
||||
@ -512,15 +548,16 @@ def cv(params, dtrain, num_boost_round = 10, nfold=3, evals = [], obj=None, feva
|
||||
num of round to be boosted
|
||||
nfold: int
|
||||
folds to do cv
|
||||
evals: list
|
||||
evals: list or
|
||||
list of items to be evaluated
|
||||
obj:
|
||||
feval:
|
||||
fpreproc: preprocessing function that takes dtrain, dtest,
|
||||
param and return transformed version of dtrain, dtest, param
|
||||
"""
|
||||
plst = list(params.items())+[('eval_metric', itm) for itm in evals]
|
||||
cvfolds = mknfold(dtrain, nfold, plst, 0)
|
||||
cvfolds = mknfold(dtrain, nfold, params, 0, eval_metric, fpreproc)
|
||||
for i in range(num_boost_round):
|
||||
for f in cvfolds:
|
||||
f.update(i)
|
||||
res = aggcv([f.eval(i) for f in cvfolds])
|
||||
f.update(i, obj)
|
||||
res = aggcv([f.eval(i, feval) for f in cvfolds])
|
||||
sys.stderr.write(res+'\n')
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user