Compare commits

..

687 Commits
v0.40 ... 0.47

Author SHA1 Message Date
Tianqi Chen
0dc68b1aef Update CHANGES.md 2016-01-14 15:58:02 -08:00
Yuan (Terry) Tang
98d8a8b871 Added contributor 2016-01-12 09:25:32 -06:00
Yuan (Terry) Tang
50af394272 Merge pull request #733 from damiencarol/javadocfix
[Java] Fix broken javadoc generation
2016-01-12 09:03:50 -06:00
damiencarol
375c106fcc Fix native/Native consistency in comments 2016-01-12 14:46:34 +01:00
Yuan (Terry) Tang
d1439a10a8 Update CONTRIBUTORS.md 2016-01-10 12:16:02 -06:00
Yuan (Terry) Tang
c44eb3ab91 Merge pull request #730 from ganesh-krishnan/master
Fixed off by 1 bug in early.stop.rounds in xgb.cv
2016-01-10 13:14:50 -05:00
damiencarol
fd3baf68f1 Fix warnings when generating javadoc 2016-01-09 15:53:35 +01:00
damiencarol
89216e239f Fix errors when generating javadoc 2016-01-09 15:45:13 +01:00
Ganesh
6ba53329e5 Fixed off by 1 bug in xgb.cv 2016-01-07 22:20:21 -08:00
Yuan (Terry) Tang
0958fb35ae Merge pull request #728 from yenchenlin1994/fix-doc-typo
Remove redundant word
2016-01-07 08:25:31 -06:00
YenChenLin
5a91ded214 Remove redundant word 2016-01-07 22:19:15 +08:00
Yuan (Terry) Tang
7606bf8156 Fixes #725 2016-01-06 18:21:29 -06:00
Yuan (Terry) Tang
1bd0f9eecd Merge pull request #724 from hxd1011/patch-1
Fix typo
2016-01-05 13:11:25 -06:00
hxd1011
8e9b7e2c67 Fix typo
"Until Know" to "Until Now"
2016-01-05 14:07:00 -05:00
Yuan (Terry) Tang
063bebe7d3 Merge pull request #722 from yenchenlin1994/fix-demo-regression-typo
Fix typo in demo/regression/README.md
2016-01-05 09:56:18 -06:00
YenChenLin
7ff704a13f Fix typo in demo/regression/README.md 2016-01-05 23:42:02 +08:00
Yuan (Terry) Tang
b684b5fada Merge pull request #720 from derek-damron/master
Add newline chars to early.stop.round message
2016-01-04 23:51:40 -06:00
Derek Damron
8756d5b160 Add newline char to early.stop.round message 2016-01-04 20:36:32 -08:00
Derek Damron
cd0099f2a1 Add newline char to early.stop.round message 2016-01-04 20:35:57 -08:00
Yuan (Terry) Tang
fa205cdaf8 Merge pull request #718 from kilojoules/patch-2
fix minor typo
2016-01-01 22:07:35 -06:00
Julian Quick
f51e1893fe fix minor typo 2016-01-01 20:03:45 -08:00
Tianqi Chen
da98e84b19 Merge pull request #714 from maarten-keijzer/doc_fix
Updated the documentation for 'gradient' and 'Hessian' (subscript error)
2015-12-30 11:55:14 +08:00
Maarten Keijzer
a6c35a8d74 Updated the documentation for 'gradient' and 'Hessian' (subscript error) 2015-12-29 15:28:43 +01:00
Yuan (Terry) Tang
d747649892 Merge pull request #712 from Far0n/py_cv
python cv bugfixing (eval metrics)
2015-12-29 07:30:26 -06:00
Yuan (Terry) Tang
ee8f189bba Merge pull request #713 from yanqingmen/java_wrapper
java wrapper modification
2015-12-29 07:18:19 -06:00
FrozenFingerz
177259a0a7 unittest for cv bugfixes added 2015-12-29 14:13:40 +01:00
yanqingmen
173ef11681 small change 2015-12-29 20:53:56 +08:00
yanqingmen
47d6d09081 add osx build instruction 2015-12-29 20:47:47 +08:00
yanqingmen
48c461ea85 change java_wrapper vs project name and script create_wrap 2015-12-29 19:50:40 +08:00
FrozenFingerz
2a46918c66 python cv bugfixing
- fixed bug if both eval_metrics xgb-param and
metrics param of cv function have been set
- cv early stopping output looks now like the one of xgb.train
2015-12-29 12:24:38 +01:00
黄子轩
2db1673585 Merge branch 'dmlc-master' into java_wrapper 2015-12-29 01:35:51 -08:00
黄子轩
4a301240bd merge from dmlc/xgboost 2015-12-29 01:34:06 -08:00
黄子轩
91fedd85b0 modify jni code 2015-12-29 01:08:19 -08:00
Tianqi Chen
4f43f1d0ac Merge pull request #711 from yoavz/tree_boosting_doc_fix
minor latex typo fix in "Introduction to Boosted Tree's" documentation
2015-12-29 08:15:32 +08:00
Yoav Zimmerman
d0ecb0cbc7 minor latex typo fix in Introduction to Boosted Tree's documentation 2015-12-28 15:42:43 -08:00
Yuan (Terry) Tang
fcb7eaa555 Merge pull request #710 from Far0n/py_cv
python cv: fixed devision by zero exception
2015-12-27 09:40:09 -06:00
FrozenFingerz
38b773d80b cv: fixed devision by zero exception
- show_progress=False or show_progress=0 led to devision by zero exception
2015-12-27 13:54:52 +01:00
Tianqi Chen
9f62553f23 Merge pull request #705 from elviswind/master
fix windows compile problem
2015-12-23 20:54:32 +08:00
junnan.wang@ef.com
dba782e985 fix windows compile problem 2015-12-23 14:33:00 +08:00
yanqingmen
4a456b2a75 small change for jni wrapper 2015-12-20 22:13:53 +08:00
huangzixuan
7d23ea7e9e add settings for OS X 2015-12-20 20:47:30 +08:00
Yuan (Terry) Tang
b942005931 Merge pull request #696 from Far0n/tc_fix
fixed wrong iter when using training continuation
2015-12-19 11:45:15 -05:00
yanqingmen
1456585249 refactor jni code and rename libxgboostjavawrapper.so to libxgboost4j.so 2015-12-19 22:26:40 +08:00
Faron
b3f3e7d0cb fixed wrong iter when using training continuation 2015-12-19 10:35:16 +01:00
Tianqi Chen
77434964ab Merge pull request #694 from khotilov/warnings_fixes
small fixes for make and gcc warnings
2015-12-19 06:50:40 +08:00
Vadim Khotilovich
f18852376f hopefully, this would make travis happy 2015-12-18 15:49:15 -06:00
Vadim Khotilovich
0c38a916fe make some gcc versions happy by using the fwrite return value 2015-12-18 15:03:39 -06:00
Vadim Khotilovich
f97c4ccb60 make gcc5 check silent when there's no gcc5 2015-12-18 14:34:16 -06:00
Vadim Khotilovich
d867579a69 make it possible to run create_wrap.sh not only from its directory 2015-12-18 14:18:28 -06:00
yanqingmen
f378fac6a1 Merge pull request #6 from dmlc/master
update
2015-12-18 14:24:08 +08:00
Yuan (Terry) Tang
4a15939c13 Merge pull request #690 from rcarneva/master
modifying cv show_progress to allow print-every-n behavior
2015-12-16 17:29:21 -06:00
Randy Carnevale
380e54a753 docstring typo 2015-12-16 17:25:55 -05:00
Randy Carnevale
0825ab36f0 updating docs for cv 2015-12-16 17:21:23 -05:00
Yuan (Terry) Tang
cfbf3595c7 Update CHANGES.md 2015-12-16 15:57:07 -06:00
Yuan (Terry) Tang
39751f8786 Merge pull request #668 from DexGroves/add-metadata
Expose model parameters to R
2015-12-16 15:55:54 -06:00
Randy Carnevale
a3fe14d6c6 modifying cv show_progress to allow print-every-n behavior 2015-12-16 16:33:01 -05:00
Groves
cd57ea2784 Add test that model paramaters are accessible within R 2015-12-16 10:24:16 -06:00
Tianqi Chen
0b17caaa27 Merge pull request #688 from khotilov/cpp_spell_doc_fixes
Spelling, wording, and doc fixes in c++ code
2015-12-12 23:22:14 -05:00
Vadim Khotilovich
b47725a65b add Eclipse stuff to .gitignore 2015-12-12 21:45:41 -06:00
Vadim Khotilovich
c70022e6c4 spelling, wording, and doc fixes in c++ code
I was reading through the code and fixing some things in the comments.
Only a few trivial actual code changes were made to make things more
readable.
2015-12-12 21:40:12 -06:00
Yuan (Terry) Tang
c56c1b9482 Merge pull request #685 from ajkl/patch-16
adding right path to setup.py
2015-12-12 20:02:42 -05:00
Ajinkya Kale
0772b51c2c minor change dir 2015-12-12 16:34:07 -08:00
Ajinkya Kale
4695fa3c2a adding right path to setup.py 2015-12-12 15:08:59 -08:00
Yuan (Terry) Tang
7a74c9523a Merge pull request #683 from terrytangyuan/pylint
Pylint Fixes
2015-12-11 19:04:38 -06:00
terrytangyuan
0eb6240fd0 Fixed all lint errors 2015-12-11 18:46:15 -06:00
terrytangyuan
a7e79e089b fix lint errors in core 2015-12-11 18:37:13 -06:00
terrytangyuan
7be496a051 ignore nested blocks 2015-12-11 18:20:35 -06:00
terrytangyuan
5f2b2a6417 Re-enable py lint test 2015-12-11 18:13:14 -06:00
terrytangyuan
c3ec8ee76f Added pylintrc file 2015-12-11 18:10:15 -06:00
Michaël Benesty
5a49eb06ca Merge pull request #682 from pommedeterresautee/master
Wording #Rstat
2015-12-10 18:54:52 +01:00
Michaël Benesty
1b07f86eb8 wording fix 2015-12-10 11:33:40 +01:00
Michaël Benesty
b2e68b8dc7 New documentation rewording 2015-12-09 18:26:56 +01:00
Michaël Benesty
2d2f92631c Merge pull request #679 from pommedeterresautee/master
Wording of R doc in new functions
2015-12-08 21:45:17 +01:00
Michaël Benesty
f761432c11 Merge remote-tracking branch 'refs/remotes/dmlc/master' 2015-12-08 18:19:25 +01:00
Michaël Benesty
fbf2707561 Wording improvement 2015-12-08 18:18:51 +01:00
Yuan (Terry) Tang
a06410055c Merge pull request #678 from phunterlau/master
update pip building, troubleshooting , and potential sklearn import error
2015-12-08 06:06:05 -06:00
pommedeterresautee
ccd4b4be00 Merge branch 'master' of https://github.com/dmlc/xgboost 2015-12-08 11:22:23 +01:00
pommedeterresautee
855be97011 model dt tree function documentation improvement 2015-12-08 11:21:25 +01:00
phunterlau
a4840b0268 update pip building, troubleshooting with new makefile, plus friendly error message when fail importing sklearn 2015-12-07 22:29:46 -08:00
Michaël Benesty
f3c5d9c1b6 Merge pull request #675 from pommedeterresautee/master
Generate new features based on tree leafs
2015-12-07 12:30:22 +01:00
Michaël Benesty
c1b2d9cb86 Generate new features based on tree leafs 2015-12-07 11:30:19 +01:00
Michaël Benesty
115c63bcde Merge remote-tracking branch 'refs/remotes/dmlc/master' 2015-12-07 11:04:46 +01:00
Yuan (Terry) Tang
162e91c5ca change .md to .rst 2015-12-06 20:25:53 -06:00
Michaël Benesty
14040123e8 Merge pull request #672 from derek-damron/patch-1
Update index.md
2015-12-07 00:08:26 +01:00
Derek Damron
ea883b30a5 Update index.md
Fixing a couple of spelling and grammatical errors.
2015-12-06 14:38:59 -08:00
Yuan (Terry) Tang
e25b2c4968 Remove redundant README 2015-12-06 11:05:44 -05:00
Michaël Benesty
3b67028ad6 remove intersect column in sparse Matrix 2015-12-05 19:02:05 +01:00
Michaël Benesty
4f4a5409d7 Merge remote-tracking branch 'refs/remotes/dmlc/master' 2015-12-05 18:30:09 +01:00
Yuan (Terry) Tang
88112f3d74 Added Apache License badge 2015-12-05 00:54:32 -05:00
Michaël Benesty
375192efa1 Merge pull request #670 from pommedeterresautee/master
Add code im demo to use the pred leaf in R
2015-12-04 16:35:43 +01:00
Michaël Benesty
2936378b76 Merge pull request #669 from dmoliveira/patch-1
Update README.md
2015-12-04 16:35:33 +01:00
pommedeterresautee
39fa45debe Add code to demo of leaf (show imprmt in accuracy) 2015-12-04 15:16:58 +01:00
Diego Marinho de Oliveira
2557d81b3b Update README.md
Link for line 26 was wrong, it pointed out again for the last demo. I was reading the readme and found the subtle inconsistence. Please, accept this minor change. It works correctly now.
2015-12-04 00:50:51 -02:00
Groves
91429bd63d Expose model parameters to R 2015-12-03 06:40:11 -06:00
pommedeterresautee
ff95d6d0ab Merge remote-tracking branch 'refs/remotes/dmlc/master' 2015-12-02 19:12:33 +01:00
Michaël Benesty
5473994a42 Merge pull request #667 from pommedeterresautee/master
change account information (pommedeterresautee)
2015-12-02 18:59:38 +01:00
Michaël Benesty
3c260c545d Merge pull request #666 from pommedeterresautee/master
Code cleaning + doc improvement #Rstat
2015-12-02 16:11:17 +01:00
pommedeterresautee
edca27fa32 Small rewording function xgb.importance 2015-12-02 15:48:22 +01:00
pommedeterresautee
db922e8c88 Small rewording function xgb.importance 2015-12-02 15:48:22 +01:00
pommedeterresautee
6ceb3438be Cleaning in documentation 2015-12-02 15:48:01 +01:00
pommedeterresautee
0abb4338a9 Cleaning in documentation 2015-12-02 15:48:01 +01:00
pommedeterresautee
7479cc68a7 Cleaning of demo 2015-12-02 15:47:45 +01:00
pommedeterresautee
e384f549f4 Cleaning of demo 2015-12-02 15:47:45 +01:00
pommedeterresautee
e57043ce62 Improve predict function documentation 2015-12-02 15:47:12 +01:00
pommedeterresautee
8233d589b6 Improve predict function documentation 2015-12-02 15:47:12 +01:00
Michaël Benesty
88e7c6012b Merge pull request #664 from pommedeterresautee/master
Support GLM in importance plot + increase tests #Rstat
2015-12-02 11:10:00 +01:00
Michaël Benesty
b708543309 Merge pull request #664 from pommedeterresautee/master
Support GLM in importance plot + increase tests #Rstat
2015-12-02 11:10:00 +01:00
pommedeterresautee
1678a6fbdb Increase cover of tests #Rstat 2015-12-02 10:40:15 +01:00
pommedeterresautee
45e6a6bbad Increase cover of tests #Rstat 2015-12-02 10:40:15 +01:00
pommedeterresautee
d04f7005de add support of GLM model in importance plot function 2015-12-02 10:39:57 +01:00
pommedeterresautee
43c860b6cc add support of GLM model in importance plot function 2015-12-02 10:39:57 +01:00
Bing Xu
5575257b08 Update README.md 2015-12-02 01:28:23 -07:00
Bing Xu
9a75daa388 Update README.md 2015-12-02 01:28:23 -07:00
Yuan (Terry) Tang
a1c0ee0e66 Merge pull request #644 from Far0n/verbose_eval_patch
small verbose_eval fixes
2015-12-01 14:58:58 -06:00
Yuan (Terry) Tang
811faa7bda Merge pull request #644 from Far0n/verbose_eval_patch
small verbose_eval fixes
2015-12-01 14:58:58 -06:00
Michaël Benesty
c870ef49da Merge pull request #662 from pommedeterresautee/master
Improve feature importance on GLM model
2015-12-01 19:02:18 +01:00
Michaël Benesty
bd2a4db26c Merge pull request #662 from pommedeterresautee/master
Improve feature importance on GLM model
2015-12-01 19:02:18 +01:00
pommedeterresautee
b05d5d3f24 Improve feature importance on GLM model 2015-12-01 18:44:25 +01:00
pommedeterresautee
28807733c3 Improve feature importance on GLM model 2015-12-01 18:44:25 +01:00
Michaël Benesty
423764ca2e Merge pull request #660 from pommedeterresautee/master
Polishing API + wording in function description #Rstat
2015-12-01 16:07:45 +01:00
Michaël Benesty
49ef81edb6 Merge pull request #660 from pommedeterresautee/master
Polishing API + wording in function description #Rstat
2015-12-01 16:07:45 +01:00
pommedeterresautee
6ce57d9cf8 Add new tests for helper functions 2015-12-01 15:44:27 +01:00
pommedeterresautee
29b73897f8 Add new tests for helper functions 2015-12-01 15:44:27 +01:00
Yuan (Terry) Tang
0ab719b59b Disable Python lint test temporarily 2015-12-01 08:39:25 -06:00
Yuan (Terry) Tang
de60db863b Disable Python lint test temporarily 2015-12-01 08:39:25 -06:00
pommedeterresautee
5d169afd7e Merge branch 'master' of https://github.com/dmlc/xgboost 2015-11-30 22:36:18 +01:00
pommedeterresautee
13a341b88d Merge branch 'master' of https://github.com/dmlc/xgboost 2015-11-30 22:36:18 +01:00
pommedeterresautee
8252d0d9f5 fix example 2015-11-30 16:33:33 +01:00
pommedeterresautee
b67902ebdd fix example 2015-11-30 16:33:33 +01:00
pommedeterresautee
2ca4016a1f fix relative to examples #Rstat 2015-11-30 16:21:43 +01:00
pommedeterresautee
425a5dd094 fix relative to examples #Rstat 2015-11-30 16:21:43 +01:00
pommedeterresautee
730bd72056 some fixes for Travis #Rstat 2015-11-30 15:47:10 +01:00
pommedeterresautee
6e370b90fd some fixes for Travis #Rstat 2015-11-30 15:47:10 +01:00
pommedeterresautee
c09c02300a Add new tests for new functions 2015-11-30 15:04:17 +01:00
pommedeterresautee
96c43cf197 Add new tests for new functions 2015-11-30 15:04:17 +01:00
pommedeterresautee
376ba6912e Update test to take care of API change 2015-11-30 14:08:27 +01:00
pommedeterresautee
ad8766dfa4 Update test to take care of API change 2015-11-30 14:08:27 +01:00
pommedeterresautee
476a6842ea Fix Rstat 2015-11-30 10:26:23 +01:00
pommedeterresautee
c5dedeb318 Fix Rstat 2015-11-30 10:26:23 +01:00
pommedeterresautee
07d62a4b89 Polishing API + wording in function description #Rstat 2015-11-30 10:22:14 +01:00
pommedeterresautee
84ab71dd7e Polishing API + wording in function description #Rstat 2015-11-30 10:22:14 +01:00
Michaël Benesty
bf19d821e0 Merge pull request #655 from pommedeterresautee/master
Add new multi tree plot function to R package
2015-11-28 18:08:27 +01:00
Michaël Benesty
09ed3f10cc Merge pull request #655 from pommedeterresautee/master
Add new multi tree plot function to R package
2015-11-28 18:08:27 +01:00
pommedeterresautee
28060d5595 Fix missing dependencies 2015-11-27 18:19:51 +01:00
pommedeterresautee
5e9f4dc973 Fix missing dependencies 2015-11-27 18:19:51 +01:00
pommedeterresautee
92e904dec9 add exclusion of global variables + generate Roxygen doc 2015-11-27 17:58:50 +01:00
pommedeterresautee
68b666d7e5 add exclusion of global variables + generate Roxygen doc 2015-11-27 17:58:50 +01:00
pommedeterresautee
2fc9dcc549 Improve description wording 2015-11-27 17:34:26 +01:00
pommedeterresautee
3d50a6a425 Improve description wording 2015-11-27 17:34:26 +01:00
pommedeterresautee
5169d08735 Add new multi.tree function to R package 2015-11-27 14:49:06 +01:00
pommedeterresautee
98ec6df168 Add new multi.tree function to R package 2015-11-27 14:49:06 +01:00
pommedeterresautee
e43830955f parameter names change in R function 2015-11-27 14:48:54 +01:00
pommedeterresautee
f28b7ed0cd parameter names change in R function 2015-11-27 14:48:54 +01:00
Michaël Benesty
9bc3d16599 Merge pull request #648 from pommedeterresautee/master
New function to plot model deepness
2015-11-24 13:52:40 +01:00
Michaël Benesty
1c4ed67779 Merge pull request #648 from pommedeterresautee/master
New function to plot model deepness
2015-11-24 13:52:40 +01:00
pommedeterresautee
6e9017c474 fix for Travis 2015-11-24 13:12:35 +01:00
pommedeterresautee
470ac2b46f fix for Travis 2015-11-24 13:12:35 +01:00
pommedeterresautee
485b30027f Plot model deepness
New function to explore the model by ploting the way splits are done.
2015-11-24 11:45:32 +01:00
pommedeterresautee
d9fe9c5d8a Plot model deepness
New function to explore the model by ploting the way splits are done.
2015-11-24 11:45:32 +01:00
Far0n
af166bf0a0 small verbose_eval fixes
- ensures same behavior for verbose_eval=0 and verbose_eval=False
- fix printing last eval message if early_stopping_rounds is set, but xgb
  runs to the end
2015-11-24 09:22:25 +01:00
tqchen
3a18b68f5f Merge commit '8ddffb36e1094e0fe3984e0eab132c23c58079a7' 2015-11-23 14:32:25 -08:00
tqchen
1b346d7041 Merge commit '8ddffb36e1094e0fe3984e0eab132c23c58079a7' 2015-11-23 14:32:25 -08:00
tqchen
8ddffb36e1 Squashed 'subtree/rabit/' changes from e81a11d..bed6320
bed6320 Merge pull request #26 from DrAndrey/master
291ab05 Remove redundant whitespace again
de25163 Remove redundant whitespace
3a6be65 Fix bug with name of sleep function

git-subtree-dir: subtree/rabit
git-subtree-split: bed63208af
2015-11-23 14:32:25 -08:00
Michaël Benesty
9cfe4bc6fe Merge pull request #647 from pommedeterresautee/master
Implement #431 PR
2015-11-23 19:47:08 +01:00
Michaël Benesty
311b1761c9 Merge pull request #647 from pommedeterresautee/master
Implement #431 PR
2015-11-23 19:47:08 +01:00
pommedeterresautee
60dd75745f Implement #431 PR 2015-11-23 18:19:59 +01:00
pommedeterresautee
fe7cdcefb4 Implement #431 PR 2015-11-23 18:19:59 +01:00
Yuan (Terry) Tang
13829329bd Merge pull request #639 from terrytangyuan/typo
Frequence to Frequency
2015-11-20 21:58:46 -06:00
terrytangyuan
51ee382517 Frequence to Frequency 2015-11-20 20:25:29 -06:00
Tianqi Chen
77fab79d83 Merge pull request #630 from sammthomson/docfix
grammar/style fixes for "Introduction to Boosted Trees" docs
2015-11-17 13:33:24 -08:00
Sam Thomson
2e9e6c82f9 grammar/style fixes for "Introduction to Boosted Trees" docs 2015-11-17 13:26:33 -08:00
Yuan (Terry) Tang
7e839c5c9e Merge pull request #627 from lenguyenthedat/patch-1
Updated build instructions for OS X.
2015-11-16 23:04:46 -06:00
Dat Le
bf50d25ea1 Updated build.md for OS X
OS X EI Capitan does not seem to stably support the clang build version anymore.
2015-11-16 10:28:12 +08:00
Yuan (Terry) Tang
83e61bf99e Merge pull request #621 from JohanManders/python-verbose-eval-extension
Python verbose_eval extension
2015-11-13 04:07:21 -06:00
Johan Manders
e68e9659ab Python verbose_eval extension
This is an extension of the verbose_eval abilities.

Removed some trailing-whitespaces
2015-11-13 08:19:44 +01:00
Yuan (Terry) Tang
cb5171914e Merge pull request #623 from sinhrks/pandas_label
Cleanup pandas support
2015-11-12 18:04:29 -06:00
sinhrks
25c4fbd0cb Cleanup pandas support 2015-11-13 06:55:04 +09:00
Yuan (Terry) Tang
4fb6153eed Fixed minor lint issue 2015-11-12 09:01:05 -06:00
Yuan (Terry) Tang
a2216c12a0 Added recent changes 2015-11-12 08:57:38 -06:00
Yuan (Terry) Tang
0a0951ba12 Clarification for best_ntree_limit 2015-11-12 08:53:45 -06:00
Yuan (Terry) Tang
42e1fd8fff Merge pull request #598 from Far0n/py_train
best_ntree_limit attribute & training continuation bugfix
2015-11-12 06:16:19 -06:00
Yuan (Terry) Tang
309fb90a5d Merge pull request #618 from phunterlau/master
fix pushd problem of pip building, convert README to rst for PyPI
2015-11-12 06:11:07 -06:00
Faron
7f2628acd7 unittest for 'num_class > 2' added 2015-11-12 08:23:11 +01:00
phunterlau
ee4096d23e fix pushd problem of pip building, convert README to rst for PyPI 2015-11-11 23:03:07 -08:00
Yuan (Terry) Tang
7b3fd92015 Added PyPI badges 2015-11-10 18:23:39 -06:00
Far0n
ce5930c365 best_ntree_limit attribute added
- best_ntree_limit as new booster atrribute added
- usage of bst.best_ntree_limit in python doc added
- fixed wrong 'best_iteration' after training continuation
2015-11-10 15:37:22 +01:00
Yuan (Terry) Tang
f91ce704f3 Merge pull request #615 from antonymayi/master
python 2.6 compatibility tweak
2015-11-10 08:26:12 -06:00
antonymayi
8c7b18daed python 2.6 compatibility tweak
replacing set literal {} with set() for python 2.6 compatibility (plus reformatting the line)
2015-11-10 14:50:54 +01:00
Yuan (Terry) Tang
d1969b4c03 Update CHANGES.md 2015-11-09 18:13:44 -06:00
Yuan (Terry) Tang
1dd96b6cdc Merge pull request #597 from JohanManders/python-pandas-dtypes
Python pandas dtypes
2015-11-09 18:08:41 -06:00
Yuan (Terry) Tang
7491413de5 Merge pull request #611 from antonymayi/master
python 2.6 compatibility
2015-11-09 08:45:26 -06:00
antonymayi
7114d6681a Update training.py
pylint compliancy
2015-11-09 15:09:14 +01:00
antonymayi
34e01642ca Update training.py
avoid dict comprehension for python 2.6 compatibility
2015-11-09 14:26:16 +01:00
Yuan (Terry) Tang
b8bc85b534 Clarification for learning_rates 2015-11-08 21:10:04 -06:00
Tong He
4db3dfee7d Update utils.R 2015-11-08 18:08:51 -08:00
Yuan (Terry) Tang
ae31bc21bc Merge pull request #610 from Far0n/master
grammar correction
2015-11-08 15:06:20 -05:00
Faron
b2f98db74e grammar correction 2015-11-08 21:00:16 +01:00
Yuan (Terry) Tang
bde25d6694 Added recent changes 2015-11-08 14:57:36 -05:00
Yuan (Terry) Tang
e837b339cc Reformat CHANGES.md 2015-11-08 14:54:52 -05:00
Yuan (Terry) Tang
01053f8f2f Merge pull request #594 from Far0n/feval
python: multiple eval_metrics changes
2015-11-08 10:10:28 -05:00
Yuan (Terry) Tang
8fc5693ef6 Merge pull request #609 from Far0n/cv_early_stopping_unittest
python: unittest for early stopping of cv
2015-11-08 09:59:18 -05:00
FrozenFingerz
3d36fa8f4e python: unittest for early stopping of cv 2015-11-08 11:42:57 +01:00
FrozenFingerz
b59018aa05 python: multiple eval_metrics changes
- allows feval to return a list of tuples (name, error/score value)
- changed behavior for multiple eval_metrics in conjunction with
early_stopping: Instead of raising an error, the last passed evel_metric
(or last entry in return value of feval) is used for early stopping
- allows list of eval_metrics in dict-typed params
- unittest for new features / behavior

documentation updated

- example for assigning a list to 'eval_metric'
- note about early stopping on last passed eval metric

- info msg for used eval metric added
2015-11-08 11:23:54 +01:00
Michaël Benesty
282a64c252 Merge pull request #608 from ClimbsRocks/patch-8
minor formatting update
2015-11-08 08:42:27 +01:00
Michaël Benesty
5268c19b6b Merge pull request #607 from ClimbsRocks/patch-7
punctuation update
2015-11-08 08:41:03 +01:00
Michaël Benesty
f5659e17d5 Merge pull request #605 from pommedeterresautee/master
Rewrite Viz function
2015-11-08 08:40:22 +01:00
Preston Parry
af047e9f8c minor formatting update 2015-11-07 22:32:18 -08:00
Preston Parry
d25efb6468 punctuation update 2015-11-07 22:27:39 -08:00
Yuan (Terry) Tang
ebbde5c343 Merge pull request #606 from cauldnz/patch-1
Fixing broken link for R sample.
2015-11-08 00:21:54 -05:00
Chris Auld
e74628f5d4 Update README.md
Fixed broken link for R 'First N Trees' sample.
2015-11-07 20:26:32 -08:00
unknown
7cb34e3ad6 Fix some bug + improve display + code clean 2015-11-07 22:24:37 +01:00
unknown
996645dc17 Change the way functions are called 2015-11-07 22:04:54 +01:00
unknown
77ae180d3d Remove DiagrammeR dependency to make travis happy... 2015-11-07 21:46:08 +01:00
unknown
0052b193cf Update lib version dependencies (for DiagrammeR mainly)
Fix @export tag in each R file (for Roxygen 5, otherwise it doesn't work anymore)
Regerate Roxygen doc
2015-11-07 21:01:28 +01:00
unknown
635645c650 Rewrite tree plot function
Replace Mermaid by GraphViz
2015-11-07 21:00:02 +01:00
unknown
231a6e7aea Merge branch 'master' of https://github.com/pommedeterresautee/xgboost
# Conflicts:
#	R-package/R/xgb.model.dt.tree.R
2015-11-07 19:13:14 +01:00
Yuan (Terry) Tang
562fe8078b Added CV early stopping to CHANGES 2015-11-07 09:45:13 -05:00
Yuan (Terry) Tang
a3a4439dec Merge pull request #602 from Far0n/cv
early stopping for CV (python) issue #529
2015-11-07 09:42:54 -05:00
Faron
95cc900b1f early stopping for CV (python) 2015-11-07 09:52:36 +01:00
Yuan (Terry) Tang
190e58a8c6 Added test for maximize parameter 2015-11-04 22:25:10 -06:00
Johan Manders
5f0f8749d9 Cleaned up some code 2015-11-04 18:05:47 +01:00
Yuan (Terry) Tang
8bf6525394 Added PyPI badge to README 2015-11-04 09:19:40 -06:00
Dat Le
117f26f865 Updated build.md for OS X
Ref: https://github.com/dmlc/xgboost/issues/596
2015-11-04 13:54:56 +08:00
Johan Manders
b0f38e9352 Changed 4 tests
Changed symbol test to give error on < sign, not on = sign
Changed 3 other functions, so that float is used instead of q
2015-11-03 21:32:47 +01:00
Johan Manders
f9e1b2b7b7 Added back feature names 2015-11-03 21:26:11 +01:00
Johan Manders
96f221e0d0 Merge pull request #5 from dmlc/master
Update to latest version
2015-11-03 20:37:20 +01:00
Yuan (Terry) Tang
e436c94419 Create CHANGES.md 2015-11-03 08:32:52 -06:00
Yuan (Terry) Tang
deb802b2be Merge pull request #587 from Far0n/py_train
python training continuation & maximize parameter
2015-11-03 08:16:12 -06:00
Far0n
8e1adddc2b added unittest for training continuation 2015-11-03 14:44:17 +01:00
Far0n
b894f7c9d6 bugfix type-check xgb_model param 2015-11-03 14:43:08 +01:00
Yuan (Terry) Tang
a71ccd8372 Merge pull request #591 from terrytangyuan/test
More test coverage for Python package
2015-11-02 21:00:52 -06:00
terrytangyuan
7d297b418f Added more thorough test for early stopping (+1 squashed commit)
Squashed commits:
[4f78cc0] Added test for early stopping (+1 squashed commit)
2015-11-02 20:37:27 -06:00
terrytangyuan
166e878830 Added tests for additional params in sklearn wrapper (+1 squashed commit)
Squashed commits:
[43892b9] Added tests for additional params in sklearn wrapper
2015-11-02 19:54:36 -06:00
Yuan (Terry) Tang
430be8d4bd Merge pull request #589 from Far0n/patch-1
Update CONTRIBUTORS.md
2015-11-02 14:52:25 -06:00
Far0n
8676a1bf56 Update CONTRIBUTORS.md 2015-11-02 21:27:05 +01:00
Faron
4fe2f2fb09 python train additions
+ training continuation of existing model
+ maximize parameter just like in R package (whether  to maximize feval)
2015-11-02 21:21:05 +01:00
Yuan (Terry) Tang
7f559235be Merge pull request #586 from Far0n/sklearn_wrapper
sklearn_wrapper additions fixed #420
2015-11-02 12:07:12 -06:00
Faron
79813097b5 sklearn_wrapper additions
added output_margin & ntree_limit to predict and predict_proba
2015-11-02 17:41:30 +01:00
Yuan (Terry) Tang
e49d06c6bd Merge pull request #585 from phunterlau/master
separate setup.py from pip installation, add trouble shooting page
2015-11-02 09:45:20 -06:00
phunterlau
739b3f2c5f separate setup.py with pip installation, add trouble shooting page 2015-11-01 22:11:11 -08:00
Yuan (Terry) Tang
9e1690defe Merge pull request #582 from terrytangyuan/test
Test (eta decay) and bug fix
2015-10-31 13:07:33 -04:00
terrytangyuan
610b70b79e Suppress more evaluation verbose during training 2015-10-31 13:05:52 -04:00
terrytangyuan
15a0d27eed Fixed bug in eta decay (+2 squashed commits)
Squashed commits:
[b67caf2] Fix build
[365ceaa] Fixed bug in eta decay
2015-10-31 12:54:27 -04:00
terrytangyuan
888edba03f Added test for eta decay (+3 squashed commits)
Squashed commits:
[9109887] Added test for eta decay(+1 squashed commit)
Squashed commits:
[1336bd4] Added tests for eta decay (+2 squashed commit)
Squashed commits:
[91aac2d] Added tests for eta decay (+1 squashed commit)
Squashed commits:
[3ff48e7] Added test for eta decay
[6bb1eed] Rewrote Rd files
[bf0dec4] Added learning_rates for diff eta in each boosting round
2015-10-31 12:36:29 -04:00
terrytangyuan
c817efbd8a Fix Travis build 2015-10-30 23:41:24 -04:00
terrytangyuan
c11d6d5929 Merge branch 'master' of https://github.com/dmlc/xgboost 2015-10-30 23:01:44 -04:00
Yuan (Terry) Tang
243fd46df9 Merge pull request #581 from ThunderShiviah/patch-1
Fix minor spelling and grammar
2015-10-30 21:55:29 -04:00
Thunder Shiviah
a0c9ecd289 Fix minor spelling errors and awkward grammar. 2015-10-30 18:43:31 -07:00
terrytangyuan
e23f4ec3db Minor addition to R unit tests 2015-10-30 19:48:00 -05:00
Yuan (Terry) Tang
9cdcc8303b Update CHANGES.md 2015-10-30 10:54:29 -05:00
Yuan (Terry) Tang
c16a6222f3 Merge pull request #563 from Far0n/eta_decay
learning_rates per boosting round
2015-10-30 10:21:33 -05:00
Tianqi Chen
3e648fd1e9 Merge pull request #572 from ghosthugger/master
install xgboost so it can be imported
2015-10-29 10:59:28 -07:00
Yuan (Terry) Tang
b9a9cd9db8 Merge pull request #580 from terrytangyuan/test
Fixed most of the lint issues
2015-10-29 00:54:16 -04:00
terrytangyuan
5b9e071c18 Fix travis build (+1 squashed commit)
Squashed commits:
[9240d5f] Fix Travis build
2015-10-29 00:28:53 -04:00
Yuan (Terry) Tang
99157ae56a Merge pull request #579 from ClimbsRocks/patch-4
minor wording update
2015-10-28 23:25:17 -04:00
terrytangyuan
6024480400 Fixed most of the lint issues 2015-10-28 23:24:17 -04:00
Preston Parry
6d35bd2421 minor wording update
just clarifying some of the language describing the parameters
2015-10-28 20:10:21 -07:00
terrytangyuan
8bae715994 Lint fix on infix operators 2015-10-28 23:04:45 -04:00
Yuan (Terry) Tang
1dcedb23ec Update CONTRIBUTORS.md 2015-10-28 22:57:41 -04:00
terrytangyuan
d7fce99564 Lint fix on consistent assignment 2015-10-28 22:22:51 -04:00
Michaël Benesty
ce9d7045f9 Merge pull request #575 from ClimbsRocks/patch-2
Clarifies explanations around Data Interface code
2015-10-28 10:02:27 +01:00
Michaël Benesty
1924e16f45 Merge pull request #576 from ClimbsRocks/patch-3
fixes typo in error message
2015-10-28 10:00:54 +01:00
Preston Parry
b3bb54da73 fixes typo in error message 2015-10-27 23:34:50 -07:00
Tianqi Chen
88b4c64c0d Merge pull request #573 from ClimbsRocks/patch-1
Clarifies wording on Data Interface intro list
2015-10-27 23:01:10 -07:00
Preston Parry
89eafa1b97 Clarifies explanations around Data Interface code 2015-10-27 22:41:29 -07:00
Preston Parry
8ddb7b0152 Clarifies wording on Data Interface intro list 2015-10-27 22:35:35 -07:00
Gösta Forsum
111b04e18e Update setup.py 2015-10-27 13:47:58 +01:00
Tong He
2e31e97e54 Merge pull request #568 from terrytangyuan/test
Added test_lint.R to test code quality
2015-10-26 22:19:48 -07:00
terrytangyuan
56da375165 Added test_lint.R to test code quality 2015-10-25 20:45:04 -04:00
Tianqi Chen
3534147905 Merge pull request #564 from Far0n/sklearn_wrapper
added missing params to sklearn python wrapper
2015-10-25 12:42:08 -07:00
Faron
738e420128 correcting wrong default values 2015-10-25 11:26:33 +01:00
Faron
b80d5d6b33 fixed too long lines 2015-10-25 11:17:35 +01:00
Faron
422febd18e added missing params 2015-10-25 10:58:07 +01:00
Faron
68c9252ff7 fixed "Exactly one space required after comma" 2015-10-25 10:20:00 +01:00
Faron
a1ba608641 learning_rates per boosting round 2015-10-25 10:00:20 +01:00
Tong He
224f574420 Merge pull request #561 from terrytangyuan/test
Added test for code quality check
2015-10-24 22:27:19 -07:00
Tianqi Chen
06f502a1aa Merge pull request #549 from phunterlau/master
Fix data file shipping confusions on pip install for #463
2015-10-24 22:08:59 -07:00
Tianqi Chen
d60ee84137 Merge pull request #560 from sinhrks/plot_importance
Python: adjusts plot_importance ylim
2015-10-24 22:08:40 -07:00
terrytangyuan
139feaf97a Code: Lint fixes on trailing spaces 2015-10-24 16:50:03 -04:00
terrytangyuan
537b34dc6f Code: Some Lint fixes 2015-10-24 16:43:44 -04:00
terrytangyuan
3abbd7b4c7 Added test_lint to test code quality 2015-10-24 16:39:58 -04:00
sinhrks
1f19b78287 Python: adjusts plot_importance ylim 2015-10-25 03:16:53 +09:00
Tianqi Chen
36927632c5 Merge pull request #557 from shimo-t/patch
fix training.py and sklearn.py for evals_result in python3
2015-10-23 09:55:50 -07:00
Takahisa Shimoda
607599f2a1 fix sklearn.py for evals_result in python3 2015-10-23 05:40:31 +09:00
Takahisa Shimoda
b587dd2704 fix training.py for evals_result in python3 2015-10-23 05:37:13 +09:00
Tianqi Chen
4b4ade8342 Update CONTRIBUTORS.md 2015-10-22 08:40:36 -07:00
Tianqi Chen
d4d36eed45 Merge pull request #528 from terrytangyuan/test
More Unit Tests for Python Package
2015-10-22 08:39:32 -07:00
Tianqi Chen
cb7f331ebc Merge pull request #555 from sinhrks/plot_sklearn
Allow plot function to handle XGBModel
2015-10-22 08:39:25 -07:00
Tianqi Chen
c4181e5f2e Merge pull request #552 from yoori/perf
GBTree::Predict performance fix: removed excess thread_temp initializ…
2015-10-22 08:39:05 -07:00
terrytangyuan
ec2cdafec5 Added fixed random seed for tests (+1 squashed commit)
Squashed commits:
[76e3664] Added fixed random seed for tests
2015-10-21 23:38:41 -05:00
terrytangyuan
755072e378 Fix failed tests (+2 squashed commits)
Squashed commits:
[962e1e4] Fix failed tests
[21ca3fb] Removed one unnecessary line
2015-10-21 23:15:34 -05:00
terrytangyuan
652ff07668 Added scikit-learn from Conda 2015-10-21 21:30:11 -05:00
phunterlau
24a92808db correct print for python 3 2015-10-21 14:32:35 -07:00
sinhrks
6f046327ac Allow plot function to handle XGBModel 2015-10-22 01:00:54 +09:00
tqchen
eee3046624 [DOC] Add contributor 2015-10-20 19:44:06 -07:00
tqchen
a16289b204 Squashed 'subtree/rabit/' changes from fa99857..e81a11d
e81a11d Merge pull request #25 from daiyl0320/master
35c3b37 add retry mechanism to ConnectTracker and modify Listen backlog to 128 in rabit_traker.py
c71ed6f try deply doxygen
62e5647 try deply doxygen
732f1c6 try
2fa6e02 ok
0537665 minor
7b59dcb minor
5934950 new doc
f538187 ok
44b6049 new doc
387339b add more
9d4397a chg
2879a48 chg
30e3110 ok
9ff0301 add link translation
6b629c2 k
32e1955 ok
8f4839d fix
93137b2 ok
7eeeb79 reload recommonmark
a8f00cc minor
19b0f01 ok
dd01184 minor
c1cdc19 minor
fcf0f43 try rst
cbc21ae try
62ddfa7 tiny
aefc05c final change
2aee9b4 minor
fe4e7c2 ok
8001983 change to subtitle
5ca33e4 ok
88f7d24 update guide
29d43ab add code
fe8bb3b minor hack for readthedocs
229c71d Merge branch 'master' of ssh://github.com/dmlc/rabit
7424218 ok
d1d45bb Update README.md
1e8813f Update README.md
1ccc990 Update README.md
0323e06 remove readme
679a835 remove theme
7ea5b7c remove numpydoc to napoleon
b73e2be Merge branch 'master' of ssh://github.com/dmlc/rabit
1742283 ok
1838e25 Update python-requirements.txt
bc4e957 ok
fba6fc2 ok
0251101 ok
d50b905 ok
d4f2509 ok
cdf401a ok
fef0ef2 new doc
cef360d ok
c125d2a ok
270a49e add requirments
744f901 get the basic doc
1cb5cad Merge branch 'master' of ssh://github.com/dmlc/rabit
8cc07ba minor
d74f126 Update .travis.yml
52b3dcd Update .travis.yml
099581b Update .travis.yml
1258046 Update .travis.yml
7addac9 Update Makefile
0ea7adf Update .travis.yml
f858856 Update travis_script.sh
d8eac4a Update README.md
3cc49ad lint and travis
ceedf4e fix
fd8920c fix win32
8bbed35 modify
9520b90 Merge pull request #14 from dmlc/hjk41
df14bb1 fix type
f441dc7 replace tab with blankspace
2467942 remove unnecessary include
181ef47 defined long long and ulonglong
1582180 use int32_t to define int and int64_t to define long. in VC long is 32bit
e0b7da0 fix

git-subtree-dir: subtree/rabit
git-subtree-split: e81a11dd7e
2015-10-20 19:37:47 -07:00
tqchen
a4ac750eb1 Merge commit 'a16289b2047a7c2ec36667f6031dbb648e4d2caa' 2015-10-20 19:37:47 -07:00
yoori
981f06b9d1 style fix 2015-10-20 00:58:11 +04:00
yoori
49c1cb6990 GBTree::Predict performance fix: removed excess thread_temp initialization 2015-10-20 00:52:37 +04:00
yoori
c0853967d5 GBTree::Predict performance fix: removed excess thread_temp initialization 2015-10-20 00:06:00 +04:00
Tianqi Chen
fd8439ffbc Update param.h
enforce parallel option to 0 for now for stable result
2015-10-19 08:59:06 -07:00
Johan Manders
7c79c9ac3a Bool gets mapped to i instead of int 2015-10-19 17:36:57 +02:00
phunterlau
8ad58139cd fix pylint warnings 2015-10-18 18:55:15 -07:00
phunterlau
7b25834667 fix data file shipping confusions, force system compiling, correct libpath for pip 2015-10-18 17:28:07 -07:00
Johan Manders
66b9a72d5a Merge pull request #4 from JohanManders/JohanManders-Pandas
More Pandas dtypes and more flexible variable naming
2015-10-17 15:17:16 +02:00
Johan Manders
9bbc3901ee More Pandas dtypes and more flexible variable naming
- Pandas DataFrame supports more dtypes than 'int64', 'float64' and 'bool', therefor added a bunch of extra dtypes for the data variable.
- From now on the label variable can be a Pandas DataFrame with the same dtypes as the data variable.
- If label is a Pandas DataFrame will be converted to float.
- If no feature_types is set, the data dtypes will be converted to 'int' or 'float'.
- The feature_names may contain every character except [, ] or <
2015-10-17 15:13:42 +02:00
Johan Manders
f116722e68 Merge pull request #3 from dmlc/master
Getting latest version from dmlc
2015-10-17 14:41:13 +02:00
Tianqi Chen
8e4dc43368 Merge pull request #540 from JohanManders/quansie-python-training-patch-1
Update training.py and sklearn.py for evals_result
2015-10-16 20:42:29 -07:00
Johan Manders
00387cb645 Removed th last few trailing whitespaces 2015-10-14 14:26:18 +02:00
Johan Manders
0f8f8e05b2 One line was too long 2015-10-14 14:18:31 +02:00
Johan Manders
82c2ba4c44 Removed trailing whitespaces and Change Error to XGBoostError 2015-10-14 14:17:57 +02:00
Johan Manders
edf4595bc1 Added evals result demos 2015-10-14 13:45:59 +02:00
Johan Manders
f1e1cc28ff Access xgboost eval metrics by using sklearn 2015-10-14 13:43:14 +02:00
Johan Manders
122ec48a89 Update evals_result.py 2015-10-14 13:40:20 +02:00
Johan Manders
6e2bdcbbbc Demo for accessing eval metrics in xgboost 2015-10-14 13:22:39 +02:00
Johan Manders
67f3c687b8 Added Johan Manders to the list, asked by Tianqi Chen 2015-10-14 13:06:14 +02:00
Johan Manders
9c8420a4dc Updated the documentation a bit
Will upload some demos for guide-python later.
2015-10-14 12:53:42 +02:00
Johan Manders
e960a09ff4 Made eval_results for sklearn output the same structure as in the new training.py
Changed the name of eval_results to evals_result, so that the naming is the same in training.py and sklearn.py

Made the structure of evals_result the same as in training.py, the names of the keys are different:

In sklearn.py you cannot name your evals_result, but they are automatically called 'validation_0', 'validation_1' etc.
The dict evals_result will output something like: {'validation_0': {'logloss': ['0.674800', '0.657121']}, 'validation_1': {'logloss': ['0.63776', '0.58372']}}

In training.py you can name your multiple evals_result with a watchlist like: watchlist  = [(dtest,'eval'), (dtrain,'train')]
The dict evals_result will output something like: {'train': {'logloss': ['0.68495', '0.67691']}, 'eval': {'logloss': ['0.684877', '0.676767']}}

You can access the evals_result using the evals_result() function.
2015-10-14 12:51:46 +02:00
Johan Manders
e339cdec52 Too many branches and unused key 2015-10-12 16:47:24 +02:00
Johan Manders
40566cdbba update sklearn.py because evals_result in training.py changed
Because I changed the training.py, the sklearn.py had to be changed also to be able to read all the data form evals_result.
2015-10-12 16:31:23 +02:00
quansie
30d0d5fb96 Merge pull request #2 from quansie/quansie-python-training-patch-1
Removed extra spaces
2015-10-12 14:28:50 +02:00
quansie
b758a13813 Removed extra spaces 2015-10-12 14:26:23 +02:00
quansie
541580d157 Update training.py 2015-10-12 14:19:25 +02:00
quansie
8a484e990e Merge pull request #1 from quansie/quansie-python-training-patch-1
training.py - pass all eval_metric information to evals_result
2015-10-12 14:11:34 +02:00
quansie
1ca737ed55 Update training.py
Made changes to training.py to make sure all eval_metric information get passed to evals_result. Previous version lost and mislabeled data in evals_result when using more than one eval_metric.

Structure of eval_metric is now:
eval_metric[evals][eval_metric] = list of metrics

Example:

>>> dtrain = xgb.DMatrix('agaricus.txt.train', silent=True)
>>> dtest = xgb.DMatrix('agaricus.txt.test', silent=True)

>>> param = [('max_depth', 2), ('objective', 'binary:logistic'), ('bst:eta', 0.01), ('eval_metric', 'logloss'), ('eval_metric', 'error')]

>>> watchlist  = [(dtest,'eval'), (dtrain,'train')]
>>> num_round = 3
>>> evals_result = {}
>>> bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=evals_result)

>>> print(evals_result['eval']['logloss'])
>>> print(evals_result)

Prints:

['0.684877', '0.676767', '0.668817']

{'train': {'logloss': ['0.684954', '0.676917', '0.669036'], 'error': ['0.04652', '0.04652', '0.04652']}, 'eval': {'logloss': ['0.684877', '0.676767', '0.668817'], 'error': ['0.042831', '0.042831', '0.042831']}}
2015-10-11 01:09:05 +02:00
Tong He
e9edb03eff Merge pull request #533 from kferris10/master
Switch default missing values from 0 to NA in R package
2015-10-08 10:47:28 -07:00
kferris
d5a34339e5 Updated Changes 2015-10-08 13:22:23 -04:00
kferris
32ca060094 Fix merge conflicts 2015-10-08 08:58:27 -04:00
Tong He
81d4d4d2c1 Update utils.R 2015-10-07 18:26:33 -07:00
kferris
7a94bdb60c Switch missing values from 0 to NA in R package 2015-10-07 18:51:47 -04:00
yanqingmen
3453b6e715 Merge pull request #5 from dmlc/master
update from dmlc/xgboost
2015-10-07 13:55:57 +08:00
terrytangyuan
1080dc256a Fix Travis build 2015-10-05 00:46:56 -05:00
terrytangyuan
fc5036a630 Deleted redundant blank lines 2015-10-04 23:29:40 -05:00
terrytangyuan
9d627e2567 DOC: Updated contributors.md 2015-10-04 23:26:46 -05:00
terrytangyuan
5dd23a2195 TST: Added test for parameter tuning using GridSearchCV 2015-10-04 23:16:00 -05:00
terrytangyuan
956e50686e TST: Added test for early stopping 2015-10-04 23:15:25 -05:00
terrytangyuan
412310ed04 Added test for regression ysing Boston Housing dataset 2015-10-04 23:04:23 -05:00
terrytangyuan
d20bfb12e4 Added assertions for classification tests 2015-10-04 23:01:07 -05:00
terrytangyuan
3dbd4af263 TST: Added tests for multi-class classification 2015-10-04 22:57:13 -05:00
terrytangyuan
7b9b4f821b TST: Added tests for binary classification 2015-10-04 22:53:31 -05:00
terrytangyuan
1411d3f37f TST: Added test for custom_objective function in cv 2015-10-04 22:45:10 -05:00
terrytangyuan
dfb89e3442 TST: Added test for show_stdv when using cv 2015-10-04 22:42:39 -05:00
terrytangyuan
0c360fe55f TST: Added test for fpreproc 2015-10-04 22:30:45 -05:00
Tianqi Chen
3109069019 Merge pull request #525 from sinhrks/df_columns
Python supports pd.DataFrame with non-str columns
2015-10-04 10:01:09 -07:00
sinhrks
dbcb4c8729 Support non-str column names 2015-10-04 13:30:01 +09:00
Tianqi Chen
2859c190cd Merge pull request #522 from sinhrks/pandas
python DMatrix now accepts pandas DataFrame
2015-10-02 10:19:14 -07:00
Tianqi Chen
9c39f69559 Merge pull request #524 from sinhrks/cv_pandas
Python CV returns pd.DataFrame or np.ndarray
2015-10-02 10:18:13 -07:00
sinhrks
b958c55ac6 CV returns ndarray or DataFrame 2015-10-02 22:38:03 +09:00
sinhrks
b943becc61 python DMatrix now accepts pandas DataFrame 2015-10-01 22:52:32 +09:00
Tianqi Chen
db490d1c75 Merge pull request #503 from sinhrks/feature_types
Python: Add feature_types to DMatrix
2015-09-29 14:14:48 -07:00
sinhrks
f6f3473d17 Change to properties 2015-09-28 22:36:39 +09:00
sinhrks
db692a30e5 Add feature_types 2015-09-28 22:25:35 +09:00
Tianqi Chen
b0591c8042 Merge pull request #514 from nerdcha/master
Fix makefile typo
2015-09-21 15:05:20 -07:00
Jamie Hall
f5920f8cbd Fix makefile typo 2015-09-22 07:18:15 +10:00
Tianqi Chen
05b242d542 Merge pull request #511 from nerdcha/master
Use homebrew gcc if available
2015-09-20 17:18:38 -07:00
Jamie Hall
6c3e4d7d0d Use homebrew gcc if available 2015-09-21 08:55:42 +10:00
Tianqi Chen
f28459497d fix pylint in setup 2015-09-18 20:22:54 -07:00
Tianqi Chen
e558d45208 Update .travis.yml 2015-09-18 18:45:18 -07:00
Tianqi Chen
788741bbcb Merge pull request #507 from nerdcha/master
Restore Python3 compatibility
2015-09-18 18:32:29 -07:00
Jamie Hall
0bca4c8c3b Restore Python3 compatibility 2015-09-19 10:46:57 +10:00
Tianqi Chen
5ff0fcc693 Merge pull request #504 from irachex/contributor
Add contributor
2015-09-17 19:38:22 -07:00
Huayi Zhang
c49c6565e5 Add contributor 2015-09-18 10:35:41 +08:00
Tianqi Chen
a92d21ce24 Merge pull request #502 from irachex/fix_setup
Fix python setup: avoid import numpy in setup.py
2015-09-17 09:35:46 -07:00
Tianqi Chen
808c0a6dff Merge pull request #497 from sinhrks/numpy_check
Bug: Fix numpy array check logic
2015-09-17 09:19:58 -07:00
sinhrks
f7d434aec2 Fix numpy array check logic 2015-09-17 22:51:44 +09:00
Huayi Zhang
6af98bec16 Fix python setup: avoid import numpy in setup.py
Currently `pip install xgboost` will raise traceback like this

```
Traceback (most recent call last):
  File "<string>", line 20, in <module>
  File "/tmp/pip-build-IAdqYE/xgboost/setup.py", line 20, in <module>
    import xgboost
  File "./xgboost/__init__.py", line 8, in <module>
    from .core import DMatrix, Booster
  File "./xgboost/core.py", line 12, in <module>
    import numpy as np
ImportError: No module named numpy
```

We should avoid importing numpy in setup.py and let pip install numpy and scipy automatically.
That's what `install_requires` for.
2015-09-17 14:49:19 +08:00
Tianqi Chen
cf2ec238a4 Merge pull request #496 from sinhrks/str_cln
Cleanup str roundtrip using ctypes
2015-09-16 16:01:42 -07:00
sinhrks
bb6b7ded55 Cleanup str roundtrip using ctypes 2015-09-17 04:10:19 +09:00
Tianqi Chen
bad4a27b9f Merge pull request #495 from aeeilllmrx/master
minor spelling and grammar changes
2015-09-16 08:40:51 -07:00
Tianqi Chen
f5eb345c8a Merge pull request #498 from sinhrks/check_binary
BUG: incorrect model_file results in segfault
2015-09-16 08:40:11 -07:00
sinhrks
db0c9e1c2d BUG: incorrect model_file results in segfault 2015-09-16 22:02:30 +09:00
Alex Miller
0b143e6d22 spelling changes 2015-09-16 01:39:01 -07:00
Alex Miller
7f3bc03990 spelling and grammar 2015-09-16 01:33:28 -07:00
Alex Miller
1f624a8005 Merge pull request #2 from aeeilllmrx/aeeilllmrx-spelling-and-grammar
spelling and grammar changes
2015-09-16 01:32:43 -07:00
Alex Miller
030a4e4e25 spelling and grammar changes 2015-09-16 01:23:31 -07:00
Alex Miller
16781ac8f9 Merge pull request #1 from dmlc/master
update from original
2015-09-16 01:16:31 -07:00
Tianqi Chen
ae43fd7c7a Merge pull request #488 from sinhrks/pyfeaturenames
Support feature names in Python package
2015-09-15 09:56:55 -07:00
sinhrks
6063d243eb Mac build fix 2015-09-15 18:39:06 +09:00
Tianqi Chen
bda3282f6d Merge pull request #492 from Far0n/patch-1
bugfix evals_result regex
2015-09-14 08:46:58 -07:00
sinhrks
48ac946d9f Use ctypes 2015-09-14 22:12:19 +09:00
Far0n
0406c64a5d bugfix evals_result regex 2015-09-14 11:25:41 +02:00
Tianqi Chen
b1c94c7d86 Merge pull request #490 from phunterlau/master
add static link to gcc + openmp for MAC
2015-09-13 18:06:28 -07:00
phunterlau
529b80406c switch back to dynamic build by default 2015-09-13 17:36:49 -07:00
phunterlau
13c8d2ba74 add multi-thread static link for MAC 2015-09-13 17:34:37 -07:00
Hongliang Liu
cbb52b1d5d Merge pull request #2 from dmlc/master
rebase to current dmlc official version
2015-09-13 15:01:22 -07:00
sinhrks
6506a1c490 ENH: allow python to handle feature names 2015-09-12 12:37:33 +09:00
Tong He
dd3126735b Merge pull request #482 from terrytangyuan/patch-1
Added xgboost demo using caret into README and added more explanation in the demo
2015-09-11 11:20:37 -07:00
terrytangyuan
424bcc05fa ENH: More comments and explanation on demo using xgboost from caret 2015-09-10 23:41:36 -04:00
Yuan Tang (Terry)
62e95dcc60 DOC: Added caret_wrapper.R link into demo/README.md 2015-09-10 23:23:30 -04:00
Tong He
0fe182d3c3 Merge pull request #479 from terrytangyuan/caretwrapper
ENH/DOC: Added R package demo using caret library to train xgbTree model
2015-09-10 12:26:03 -07:00
Tianqi Chen
0c0e26effa Update README.md 2015-09-08 19:45:39 -07:00
Tianqi Chen
2a8c1c677e Merge pull request #476 from terrytangyuan/patch-1
DOC: Typo in README.md in tests folder
2015-09-08 19:38:43 -07:00
Tianqi Chen
4380641714 Merge pull request #478 from terrytangyuan/tests
TST: Added some unit tests for Python
2015-09-08 19:38:30 -07:00
terrytangyuan
9ead44531e DOC: Added new demo to index 2015-09-08 10:54:07 -04:00
terrytangyuan
d3bb466026 ENH/DOC: Added R package demo using caret library to train xgbTree model 2015-09-08 10:51:20 -04:00
terrytangyuan
8196d5d680 TST: More thorough checks for Python tests 2015-09-08 10:14:28 -04:00
terrytangyuan
82a43f448e TST: Added Python test for custom objective functions 2015-09-08 09:54:38 -04:00
terrytangyuan
eb1b185d70 TST: Added glm test for Python 2015-09-08 09:47:48 -04:00
Tong He
67f40b2629 Merge pull request #475 from terrytangyuan/master
More thorough unit testing for R package
2015-09-07 20:30:10 -07:00
terrytangyuan
33f1ab3ae1 TST: Added one minor check for xgb.importance 2015-09-07 22:51:14 -04:00
terrytangyuan
fbf2a5feed DOC: Updated CONTRIBUTORS.md 2015-09-07 22:49:10 -04:00
Yuan Tang (Terry)
cb3afeec53 DOC: Typo in README.md in tests folder 2015-09-07 22:23:47 -04:00
terrytangyuan
c50cf6d7ff TST: Added test for poisson regression 2015-09-07 22:03:28 -04:00
terrytangyuan
3a49e1bdb1 TST: Added more checks for testing custom objective 2015-09-07 21:56:50 -04:00
terrytangyuan
886955148d TST: Added test for models with custom objective 2015-09-07 21:55:17 -04:00
terrytangyuan
408c3a62a8 TST: Added test for xgb.plot.tree 2015-09-07 21:49:27 -04:00
terrytangyuan
d833038ba1 TST: Added test for xgb.importance 2015-09-07 21:48:57 -04:00
terrytangyuan
78afd6c772 TST: Added test for dump 2015-09-07 21:36:52 -04:00
Tianqi Chen
f025488294 Merge pull request #473 from evilmucedin/master
make XGBClassifier.score compatible with arrays
2015-09-06 21:18:12 -07:00
Den Raskovalov
35944a13b4 make XGBClassifier.score compatible with arrays 2015-09-06 20:41:55 -07:00
Tong He
6109a70a16 Merge pull request #471 from terrytangyuan/master
TST: Added R unit test for glm
2015-09-06 20:29:37 -07:00
Yuan Tang (Terry)
339a53d9d4 fixed unit test in R 2015-09-06 20:00:25 -04:00
Tianqi Chen
b3a3228a02 Merge pull request #469 from Far0n/patch-1
alpha & lambda for gbtree
2015-09-06 12:37:56 -07:00
terrytangyuan
92b996513e TST: Added R unit test for glm 2015-09-05 22:50:27 -04:00
Far0n
cfcb1fc491 default values for gbtree: lambda=1, alpha=0 2015-09-05 21:53:37 +02:00
Far0n
a9f884bd47 alpha = 1 as default value for gbtree 2015-09-05 21:50:53 +02:00
Far0n
dbc5c9b82d alpha & lambda for gbtree
alpha & lambda descriptions to "Parameters for Tree Booster" added (issue #466)
2015-09-05 12:36:42 +02:00
unknown
3d6c831e8a add error for data.frame, add weight to xgboost 2015-09-02 21:43:23 -07:00
Tianqi Chen
baa3145817 Merge pull request #461 from okaoka/fix-parameter-typo
Fix a typo in parameter.md
2015-08-29 10:02:40 -07:00
okaoka
632fdc3e19 Fix a typo 2015-08-29 19:45:11 +09:00
hetong007
57a43e9da7 Merge branch 'master' of github.com:dmlc/xgboost 2015-08-27 16:06:36 -07:00
hetong007
5773d4d3c4 fix test 2015-08-27 16:02:41 -07:00
hetong007
4554da0537 add test module in R 2015-08-27 15:56:35 -07:00
Tong He
635c39c4c3 Update README.md 2015-08-27 15:35:53 -07:00
hetong007
b0be833c75 add save_period 2015-08-27 14:30:23 -07:00
yanqingmen
34f0b313af Merge pull request #4 from dmlc/master
update
2015-08-26 16:32:05 +08:00
Tianqi Chen
c4fa2f6110 Update model.md 2015-08-23 22:46:50 -07:00
Tong He
f305cdbf75 align formula 2015-08-23 22:31:00 -07:00
tqchen
6bcf35f2e1 minor 2015-08-23 22:06:38 -07:00
tqchen
3c114262aa Merge branch 'master' of ssh://github.com/dmlc/xgboost 2015-08-23 22:04:24 -07:00
Tianqi Chen
b8330fc58a Merge pull request #456 from phunterlau/master
add platform if statement in setup.py for pip for pull #450 issue
2015-08-23 22:04:19 -07:00
tqchen
483a7d05e9 Merge branch 'master' of ssh://github.com/dmlc/xgboost
Conflicts:
	doc/index.md
	doc/model.md
2015-08-23 22:03:50 -07:00
tqchen
8c4c754a72 update 2015-08-23 22:00:41 -07:00
phunterlau
f4a5a8b6cd switch back to the original version info 2015-08-23 21:28:13 -07:00
phunterlau
bc6e2af374 add back setup.py after conflict resolving 2015-08-23 21:25:38 -07:00
phunterlau
6231e153e6 Merge branch 'dmlc-master' 2015-08-23 21:22:08 -07:00
phunterlau
2dcf263536 Merge branch 'master' of git://github.com/dmlc/xgboost into dmlc-master
Conflicts:
	python-package/setup.py
2015-08-23 21:20:31 -07:00
phunterlau
f258a68029 add platform if statement in setup.py for pip for pull #450 issuecomment-133795287 2015-08-23 20:38:26 -07:00
hetong007
7294ac4fc9 refine model doc 2015-08-23 17:04:08 -07:00
hetong007
cc3c98d9b7 fix formula 2015-08-23 16:59:29 -07:00
hetong007
30c30d3696 modify model doc 2015-08-23 16:56:57 -07:00
hetong007
5196458305 add plot 2015-08-23 16:28:24 -07:00
hetong007
d5d48560a7 add model description 2015-08-23 16:25:28 -07:00
Tianqi Chen
32009942fd Merge pull request #455 from sinhrks/py3
Python Visualization Fix for python 3
2015-08-23 14:29:33 -07:00
sinhrks
00702dc39b Fix for python 3 2015-08-24 05:09:27 +09:00
Tianqi Chen
8e06726f6b Merge pull request #454 from VGuette/master
Missing parentheses in call to 'print'  Thanks for the contribution!
2015-08-23 09:16:16 -07:00
VGuette
10273a0288 Update setup.py 2015-08-23 11:01:43 +02:00
Tianqi Chen
19eef1d0da Merge pull request #450 from phunterlau/master
add necessary configrations for pip installation
2015-08-20 18:45:30 -07:00
Tong He
07182444d2 Update README.md 2015-08-20 13:53:20 -07:00
phunterlau
5e81a210ce polish README.md with more information for PR #450 2015-08-20 12:33:28 -07:00
phunterlau
db444c4a08 update with comments on PR #450, fixed styles and updated CHANGES and CONTRIBUTORS 2015-08-20 10:10:34 -07:00
phunterlau
70e230815b add necessary configrations for pip installation 2015-08-20 01:26:17 -07:00
Tianqi Chen
4af680c3b6 Merge pull request #439 from sinhrks/pyviz
Add visualization to python package! great job
2015-08-15 09:48:49 -07:00
sinhrks
d24b36adf9 ENH: Add visualization to python package 2015-08-16 00:57:21 +09:00
Tianqi Chen
a13a3d1552 Merge pull request #443 from jdwittenauer/master
Cleaned up guide-python directory.
2015-08-13 18:07:42 -07:00
John Wittenauer
7a3676851d Cleaned up guide-python directory. 2015-08-13 20:32:47 -04:00
Tianqi Chen
a7202ee804 Merge pull request #438 from terrytangyuan/patch-1
fixed typos in basic_walkthrough demo
2015-08-10 22:45:24 -07:00
Yuan Tang
3dd40b9f37 fixed typos in basic_walkthrough demo 2015-08-10 20:35:10 -04:00
Tianqi Chen
18e1ddec3c Merge pull request #435 from terrytangyuan/typos
fixed some typos in demos comments
2015-08-09 19:51:32 -07:00
terrytangyuan
b3bffcef34 fixed some typos in demos comments 2015-08-09 22:15:02 -04:00
El Potaeto
740db8ff02 Merge remote-tracking branch 'dmlc/master' 2015-08-05 12:07:41 +02:00
Tianqi Chen
752cf4c95d Update xgboost_R.cpp 2015-08-04 22:56:16 -07:00
Tianqi Chen
b30aa96a88 Update xgboost_R.cpp 2015-08-04 20:14:58 -07:00
tqchen
0f6ad749f5 remove debug messages fix lint 2015-08-04 19:40:30 -07:00
Tianqi Chen
f42e4932fa Merge pull request #430 from EricChenDM/master
fix SetCombine and SetPrune bug
2015-08-04 19:36:42 -07:00
EricChanBD
3d38ebbef5 fix SetCombine and SetPrune bug 2015-08-05 06:19:54 +08:00
Tianqi Chen
889887c2f1 Update README.md 2015-08-03 19:37:33 -07:00
Tianqi Chen
7fe8b95833 Update README.md 2015-08-03 19:36:29 -07:00
Tianqi Chen
bd1eaa25f2 Merge pull request #424 from ajkl/patch-14
Adding dmlc stamp
2015-08-03 19:25:30 -07:00
Ajinkya Kale
81b1befd10 Adding dmlc stamp 2015-08-03 15:46:22 -07:00
muli
64dd1973b9 align logo with title 2015-08-03 12:59:28 -04:00
Tong He
bf94add992 Update faq.md 2015-08-02 19:09:33 -07:00
Tong He
f7bb8fc10f Update README.md 2015-08-02 19:04:32 -07:00
Tong He
014fa02c6a Update README.md 2015-08-02 19:03:44 -07:00
tqchen
e8de5da3a5 Document refactor
change badge
2015-08-02 19:01:38 -07:00
tqchen
c43fee541d enable basic sphinx doc 2015-08-01 11:27:13 -07:00
tqchen
8083c30e7b quick fix of solaris problem in cranc check 2015-08-01 09:18:34 -07:00
hetong007
3a091fa302 modify desc 2015-07-31 21:33:54 +00:00
Tianqi Chen
2a01c5c865 Update CONTRIBUTORS.md 2015-07-30 22:26:10 -07:00
Tianqi Chen
362fe4e4fa Update .travis.yml 2015-07-30 22:11:27 -07:00
tqchen
60217a2c02 checkin all python 2015-07-30 22:08:48 -07:00
tqchen
c2fec29bfa python package refactor into python-package 2015-07-30 22:04:45 -07:00
Tianqi Chen
f6fed76e7e not working 2015-07-29 23:24:54 -07:00
Tianqi Chen
7560518eec sleep 2015-07-29 23:23:40 -07:00
Tianqi Chen
53107995bf give up for now 2015-07-29 22:54:21 -07:00
Tianqi Chen
264c636adf add dep 2015-07-29 22:50:23 -07:00
Tianqi Chen
f9c02aa40f final attempt 2015-07-29 22:45:28 -07:00
Tianqi Chen
11f27beccd checkin debug 2015-07-29 22:41:06 -07:00
Tianqi Chen
ebdcd94bf5 Merge pull request #418 from dmlc/travis
Travis OSX support and unfinished appveyor
2015-07-29 22:36:24 -07:00
tqchen
4a6f4eaac9 giveup for now, appveyor do not support openmp for msvc yet allow openmp to switch on 2015-07-29 22:31:35 -07:00
tqchen
ebefb78fd4 use debug 2015-07-29 22:26:21 -07:00
tqchen
73ec467dd3 final 2015-07-29 22:22:43 -07:00
tqchen
0a9c8acd6d final 2015-07-29 22:17:25 -07:00
tqchen
6f01fa50ce try disable omp 2015-07-29 22:14:38 -07:00
tqchen
67d332e0f5 ok 2015-07-29 22:01:42 -07:00
tqchen
5dab410537 ok 2015-07-29 22:00:38 -07:00
tqchen
259dea0777 incomplete appveyor 2015-07-29 21:46:41 -07:00
tqchen
e30c724bd4 ok 2015-07-29 21:39:34 -07:00
tqchen
6f4148faab ok 2015-07-29 21:37:16 -07:00
tqchen
7e16606618 ok 2015-07-29 21:36:28 -07:00
tqchen
c2c5ad2d47 finl 2015-07-29 21:35:15 -07:00
tqchen
1a91b15a6e ok 2015-07-29 21:27:40 -07:00
tqchen
bb13c2cd15 ok 2015-07-29 21:25:52 -07:00
tqchen
033a0c139e ok 2015-07-29 21:21:58 -07:00
tqchen
0d5741bc74 rest 2015-07-29 21:21:15 -07:00
tqchen
899bfbfbae rest 2015-07-29 21:19:49 -07:00
tqchen
2bf0eeb82d update appvegor 2015-07-29 21:15:25 -07:00
tqchen
c870c08b7e disable openmp in dmlc 2015-07-29 21:11:44 -07:00
tqchen
fa41fe3f13 rename 2015-07-29 21:09:42 -07:00
tqchen
8f6e5e197b ok 2015-07-29 21:07:18 -07:00
tqchen
15286523cf ok 2015-07-29 21:06:29 -07:00
tqchen
d9599f816f add appvegor 2015-07-29 21:01:53 -07:00
tqchen
6062f4dd58 update 2015-07-29 20:18:54 -07:00
tqchen
24a188588a ok 2015-07-29 20:10:29 -07:00
tqchen
2ab6907fe2 add os lrt 2015-07-29 18:45:42 -07:00
tqchen
f44511e94d fix mac build 2015-07-29 18:29:06 -07:00
tqchen
26675e6dcd Merge branch 'master' of ssh://github.com/dmlc/xgboost 2015-07-29 18:24:27 -07:00
tqchen
75c8bdf962 add osx matrix 2015-07-29 18:24:19 -07:00
Tong He
efde0eb171 enable travis on os x 2015-07-29 18:16:59 -07:00
Tong He
f4a47fa78e Merge pull request #414 from ajkl/patch-12
Fixing duplicate params in demo
2015-07-29 17:58:21 -07:00
tqchen
5f9f42292c fix sklearn best score 2015-07-29 17:49:55 -07:00
Tianqi Chen
c261b3d1f5 Merge pull request #416 from ajkl/patch-13
add setuptools info
2015-07-29 17:38:58 -07:00
Ajinkya Kale
cca955fc94 add setuptools info 2015-07-29 16:20:55 -07:00
Ajinkya Kale
0c8c231949 Fixing duplicate params in demo
Issue in "demo(package="xgboost", custom_objective)"

> bst <- xgb.train(param, dtrain, num_round, watchlist, 
+                  objective=logregobj, eval_metric=evalerror)
Error in xgb.train(param, dtrain, num_round, watchlist, objective = logregobj,  : 
  Duplicated term in parameters. Please check your list of params.
2015-07-29 14:28:34 -07:00
Tianqi Chen
d485d1849f Merge pull request #409 from ajkl/patch-11
fixing broken basic_walkthrough links
2015-07-26 21:23:12 -07:00
Ajinkya Kale
74055cc15e fixing broken basic_walkthrough links 2015-07-26 21:22:35 -07:00
Tianqi Chen
195f90159d Merge pull request #408 from ajkl/patch-10
restructuring the README with an index
2015-07-26 21:14:48 -07:00
Ajinkya Kale
fc27e2f32d adding DMLC back to the title 2015-07-26 20:31:51 -07:00
Ajinkya Kale
f2eb55683c some more links and restructuring 2015-07-26 20:30:59 -07:00
Ajinkya Kale
9a936721d8 dropping raw graphlab url 2015-07-26 20:12:51 -07:00
Tianqi Chen
eee0d5b065 Merge pull request #327 from jseabold/sklearn-eval-set
ENH: Allow early stopping through scikit-learn API
2015-07-26 11:58:45 -07:00
Tianqi Chen
b1dec917c7 Update page_fmatrix-inl.hpp 2015-07-25 21:29:46 -07:00
tqchen
0dbac3d11e fix travis 2015-07-25 21:23:40 -07:00
tqchen
f6c82d52ec make solaris happy 2015-07-25 21:17:28 -07:00
tqchen
af042f6a24 make things cxx98 compatible 2015-07-25 21:14:50 -07:00
Ajinkya Kale
cbdcbfc49c some more changes to remove redundant information 2015-07-25 12:46:28 -07:00
Ajinkya Kale
e353a2e51c restructuring the README with an index 2015-07-24 17:00:02 -07:00
hetong007
a1c7104d7f fix crash 2015-07-24 19:11:08 +00:00
unknown
198c5bb55e fix namespace and desc 2015-07-24 11:58:02 -07:00
Tianqi Chen
141f9ebf4b Update CHANGES.md 2015-07-24 08:51:05 -07:00
Michaël Benesty
f29c2f8796 Merge pull request #404 from ajkl/patch-8
moving gitter chat up
2015-07-23 15:06:55 +02:00
Michaël Benesty
5e07367979 Merge pull request #405 from ajkl/patch-9
Add license to readme
2015-07-23 10:34:45 +02:00
Ajinkya Kale
0ea5b14bd8 Update README.md 2015-07-23 01:12:33 -07:00
Ajinkya Kale
9eca9bccf4 moving gitter chat up 2015-07-22 23:18:34 -07:00
pommedeterresautee
951ba267cf move plot file 2015-07-22 23:50:54 +02:00
Michaël Benesty
1fb5c127b5 Merge pull request #399 from orenov/master
issue #368, data.table problems
2015-07-22 21:21:34 +02:00
Michaël Benesty
4a71b0ec19 Merge pull request #402 from wgstanton/patch-2
Fixed a few typos in README
2015-07-22 18:44:30 +02:00
Will Stanton
ba63b2886f Check out vs. checkout
Made it consistent across the README
2015-07-22 10:37:49 -06:00
Will Stanton
d120167725 Fixed a few typos in README 2015-07-22 09:19:22 -06:00
El Potaeto
031b34b121 Merge remote-tracking branch 'dmlc/master' 2015-07-22 13:30:38 +02:00
orenov
d8fc16538e issue #368, data.table problems 2015-07-22 12:03:01 +03:00
Tianqi Chen
80b6ec4478 update more contributor names 2015-07-21 21:31:39 -07:00
Tianqi Chen
9203d26a2f Update CONTRIBUTORS.md 2015-07-21 08:13:07 -07:00
Tianqi Chen
4cf116ceb6 Update CONTRIBUTORS.md 2015-07-20 22:58:10 -07:00
Tianqi Chen
41f30c288e Update CONTRIBUTORS.md 2015-07-20 22:56:29 -07:00
tqchen
b18c7f9466 ok 2015-07-20 22:50:59 -07:00
tqchen
d18492e751 add list of contributors 2015-07-20 22:48:45 -07:00
El Potaeto
86f9f707d8 Merge remote-tracking branch 'dmlc/master' 2015-07-15 16:00:21 +02:00
El Potaeto
0dfc443252 New projection of all trees on one 2015-07-15 15:59:36 +02:00
Tianqi Chen
71cd9b9000 Merge pull request #393 from jpata/wrapper-dict-fix
fix wrapper dict issue #392 thanks! merged
2015-07-14 08:53:37 -07:00
Joosep
be95c80aa2 fix wrapper dict 2015-07-14 11:38:38 +02:00
Tianqi Chen
b7f355fdd2 Update travis_after_failure.sh 2015-07-12 11:00:52 -07:00
Tianqi Chen
4a746be43a Update build.md 2015-07-12 10:36:16 -07:00
Tianqi Chen
44f839b896 Update README.md 2015-07-12 10:31:55 -07:00
Tianqi Chen
35638f6146 Update README.md 2015-07-12 10:27:58 -07:00
Tianqi Chen
e402d20876 Update README.md 2015-07-10 20:41:20 -07:00
Tianqi Chen
dabb36c006 Update README.md 2015-07-10 20:41:00 -07:00
Skipper Seabold
b76db01c66 STY: Fix lint errors 2015-07-08 14:29:52 -05:00
Skipper Seabold
4a37b852a0 DOC: Add early stopping example 2015-07-08 13:55:47 -05:00
Skipper Seabold
b0f7ddaa2e REF: Combine eval_metric and feval to one parameter 2015-07-08 13:55:47 -05:00
Skipper Seabold
113285e1dc DOC: Point to parameter.md for eval_metric 2015-07-08 13:55:47 -05:00
Skipper Seabold
46e9520a28 DOC: Document verbose_eval 2015-07-08 13:55:47 -05:00
Skipper Seabold
cf89ae64e2 ENH: Allow for silent evaluation 2015-07-08 13:55:47 -05:00
Skipper Seabold
3952b525b8 ENH: Allow possibly negative evaluation metrics. 2015-07-08 11:10:36 -05:00
Skipper Seabold
0f5f9c0385 ENH: Allow early stopping in sklearn API. 2015-07-08 11:10:36 -05:00
Tianqi Chen
167544d792 Merge pull request #382 from ajkl/patch-6
refs and formatting changes
2015-07-07 19:32:52 -07:00
Tianqi Chen
1fee7da16f Merge pull request #384 from ajkl/patch-7
need to load vcd if it was freshly installed
2015-07-07 19:32:28 -07:00
Tianqi Chen
048d6929f4 Merge pull request #375 from yanqingmen/java_wrapper
good job! merged
2015-07-07 19:31:54 -07:00
Ajinkya Kale
57e4f4d426 need to load vcd if it was freshly installed 2015-07-07 17:36:18 -07:00
yanqingmen
969ea57159 Update travis_java_script.sh
add "set -e"
2015-07-07 17:28:45 -07:00
Ajinkya Kale
c489ce62b2 refs and formatting changes 2015-07-07 16:36:45 -07:00
yanqingmen
fc75885e9e add travis-ci script for java wrapper 2015-07-07 19:22:51 +08:00
Tianqi Chen
28f8267563 Update README.md 2015-07-06 22:45:27 -07:00
Tianqi Chen
9ec4c43dd2 Update README.md 2015-07-06 22:44:59 -07:00
Tianqi Chen
46342d4633 checkin 2015-07-06 20:07:04 -07:00
Tianqi Chen
fd26f45208 Merge pull request #377 from ajkl/patch-3
Adding some details on nthread parameter
2015-07-06 19:58:44 -07:00
Tianqi Chen
13aff0d8cd Merge pull request #378 from ajkl/patch-4
Adding workaround for install the R-package
2015-07-06 19:55:25 -07:00
Tianqi Chen
af76bbb3f3 Merge pull request #379 from ajkl/patch-5
Adding examples on xgb.importance, xgb.plot.importance and xgb.plot tree
2015-07-06 19:55:06 -07:00
yanqingmen
0fc47f5abb add testcases 2015-07-06 18:50:46 -07:00
yanqingmen
4d382a8cc1 rename xgboosterror 2015-07-06 17:55:13 -07:00
Ajinkya Kale
364abdd6d1 Adding examples on xgb.importance, xgb.plot.importance and xgb.plot tree 2015-07-06 16:45:30 -07:00
Ajinkya Kale
761ab7c834 Adding workaround for install the R-package
I was facing this issue and this workaround worked for me. Maybe this should be moved to know issues section.
2015-07-06 14:52:38 -07:00
Ajinkya Kale
b1bcb7183b Adding some details on nthread parameter
I got this information about nthread='real cpu count' from 7cb449c4a7/java/xgboost4j-demo/src/main/java/org/dmlc/xgboost4j/demo/ExternalMemory.java (L50)
Please confirm if this note is still valid before merging this change!
2015-07-06 11:02:19 -07:00
yanqingmen
e99ab0d1dd minor fix 2015-07-06 20:56:17 +08:00
yanqingmen
f73bcd427d update java wrapper for new fault handle API 2015-07-06 02:32:58 -07:00
yanqingmen
7755c00721 Merge pull request #2 from dmlc/master
pr from origin:master
2015-07-06 09:00:42 +08:00
tqchen
a735f8cb76 quick patch threadlocal 2015-07-04 18:29:42 -07:00
tqchen
cc767add88 API refactor to make fault handling easy 2015-07-04 18:12:44 -07:00
Tianqi Chen
4d436a3cb0 Update README.md 2015-07-03 21:59:40 -07:00
Tianqi Chen
53a18635ee Merge pull request #371 from ajkl/patch-2
fixing some typos
2015-07-03 21:42:54 -07:00
tqchen
f0421e9455 last check 2015-07-03 21:27:29 -07:00
tqchen
93319841ed ok 2015-07-03 21:20:56 -07:00
tqchen
ccf21ec061 add scipy dep 2015-07-03 21:15:10 -07:00
tqchen
39913d6ee8 add scipy dep 2015-07-03 21:14:49 -07:00
tqchen
fe3464b763 update script 2015-07-03 21:11:01 -07:00
tqchen
af0a451dc4 refactor and ci 2015-07-03 21:08:36 -07:00
tqchen
59b91cf205 make python lint 2015-07-03 20:36:41 -07:00
tqchen
57ec922214 fix all cpp lint 2015-07-03 19:42:44 -07:00
tqchen
1123253f79 lint all 2015-07-03 19:35:23 -07:00
tqchen
aba41d07cd lint learner finish 2015-07-03 19:20:45 -07:00
tqchen
1581de08da fix all utils 2015-07-03 18:44:01 -07:00
tqchen
0162bb7034 lint half way 2015-07-03 18:31:52 -07:00
Ajinkya Kale
c70a73f38d fixing some typos 2015-07-01 22:35:41 -07:00
Tong He
2ed40523ab Merge pull request #369 from ajkl/patch-1
Some typo and formatting fixes
2015-07-01 13:05:31 -07:00
Ajinkya Kale
009f692f49 Some typo and formatting fixes 2015-07-01 12:12:47 -07:00
Tong He
48e19c1964 Update xgb.cv.R 2015-06-22 12:42:12 -07:00
Tong He
704d9e0a13 fix early stopping and prediction 2015-06-21 19:46:31 -07:00
Tong He
6b254ec495 Update Makefile 2015-06-21 19:25:09 -07:00
tqchen
561e51871e ok 2015-06-17 21:00:34 -07:00
Tong He
777c5ce992 temporarily do not compile vignette 2015-06-16 15:08:01 -07:00
Tong He
70c5c12067 update knitr dependency 2015-06-16 14:39:04 -07:00
Tong He
1595d36721 ask travis to compile vignette 2015-06-16 14:22:51 -07:00
pommedeterresautee
37714eb331 Merge branch 'master' of https://github.com/pommedeterresautee/xgboost 2015-06-16 21:40:09 +02:00
pommedeterresautee
ad2e93f6c5 multi tree update 2015-06-16 21:39:31 +02:00
pommedeterresautee
936190c17c slight update in documentation 2015-06-16 21:38:14 +02:00
hetong007
9987fb24f8 update makefile 2015-06-16 11:43:04 -07:00
hetong007
67f0b69a4c change makefile to be compatible with r-travis 2015-06-16 11:30:11 -07:00
Tong He
5568f83a6c Update .travis.yml 2015-06-15 22:40:15 -07:00
Tong He
b08c3c5baa Update .travis.yml 2015-06-15 22:16:11 -07:00
Tong He
7d9ac3f97d Update .travis.yml 2015-06-15 19:15:34 -07:00
hetong007
0bbb4a07b2 add travis conf, waiting for setting on travis-ci.org 2015-06-15 15:25:40 -07:00
tqchen
7a92d4008e fix col from dense 2015-06-15 09:24:10 -07:00
hetong007
c51d71b033 check duplicated params 2015-06-12 16:48:01 -07:00
Tong He
7cb449c4a7 Update xgb.cv.R 2015-06-11 14:16:20 -07:00
Tong He
61142f203b check whether objective is character 2015-06-11 14:04:43 -07:00
Tianqi Chen
fbaa3821a4 Merge pull request #351 from yanqingmen/java_wrapper
Java wrapper for xgboost
2015-06-11 09:02:32 -07:00
yanqingmen
4e8a1c6516 rm WatchList class, take Iterable<Entry<String, DMatrix>> as eval param, change Params to Iterable<Entry<String, Object>> 2015-06-10 23:34:52 -07:00
yanqingmen
8c5d3ac130 Merge branch 'java_wrapper' of https://github.com/yanqingmen/xgboost into java_wrapper 2015-06-10 20:11:11 -07:00
yanqingmen
c110111f52 make some fix 2015-06-10 20:09:49 -07:00
yanqingmen
1e03be4e08 Update Makefile 2015-06-09 23:30:00 -07:00
yanqingmen
f91a098770 add java wrapper 2015-06-09 23:14:50 -07:00
yanqingmen
fcca359774 Merge pull request #1 from dmlc/master
pull from dmlc
2015-06-10 09:09:42 +08:00
Tianqi Chen
00a8076deb Merge pull request #350 from jeremyatia/patch-1
Update understandingXGBoostModel.Rmd
2015-06-08 16:36:40 -07:00
Jeremy ATIA
a6abdccf01 Update understandingXGBoostModel.Rmd
a typo for the dimension of the test set
2015-06-08 23:31:12 +02:00
El Potaeto
ab219d3331 Merge remote-tracking branch 'dmlc/master' 2015-06-03 11:18:45 +02:00
tqchen
2937f5eebc io part refactor 2015-06-02 23:18:31 -07:00
tqchen
e5dd894960 add a indicator opt 2015-06-02 11:38:06 -07:00
Tong He
bc7f6b37b0 Update README.md 2015-05-30 17:39:19 -07:00
hetong007
36031d9a36 modify script to use objective and eval_metric 2015-05-30 15:48:57 -07:00
Tong He
27e4cbb215 Merge pull request #337 from jonrobinson2/patch-1
Update xgboostPresentation.Rmd
2015-05-28 09:32:32 -07:00
Tong He
f9ae83e951 Update xgb.cv.R 2015-05-28 09:30:23 -07:00
Jonathan Robinson
a55f4d3416 Update xgboostPresentation.Rmd
Edited to note unavailability of stable version of this package on CRAN.

http://cran.r-project.org/web/packages/xgboost/index.html
2015-05-28 09:45:46 -04:00
hetong007
733d23aef8 rename arguments to be dot-seperated 2015-05-25 11:51:01 -07:00
hetong007
8d3a7e1688 change doc and demo for new obj feval interface 2015-05-25 11:30:04 -07:00
hetong007
19b24cf978 customized obj and feval interface 2015-05-25 11:19:38 -07:00
Tong He
458585b5fd Update xgb.train.R 2015-05-25 10:24:59 -07:00
Tianqi Chen
1d57cfb7bd Update xgboost.py 2015-05-22 13:27:08 -07:00
Tianqi Chen
bc7241b2a4 Update README.md 2015-05-21 13:44:21 -07:00
Tianqi Chen
7d132aefa9 Update LICENSE 2015-05-21 13:01:15 -07:00
Tianqi Chen
a31aaa410c Update parameter.md 2015-05-20 17:27:15 -07:00
Tianqi Chen
da5e62773d Merge pull request #328 from drsaltiel/patch-1
Update parameter.md to include parameter ranges
2015-05-20 17:26:00 -07:00
Daniel Saltiel
b1c79323af Update parameter.md to include parameter ranges
only updated for tree booster parameters
2015-05-20 17:13:20 -07:00
Tianqi Chen
c82101ef16 Merge pull request #324 from jseabold/allow-zero-as-missing
ENH: Allow missing = 0
2015-05-18 18:54:17 +02:00
Skipper Seabold
978216d350 ENH: Allow missing = 0 2015-05-18 11:43:58 -05:00
Tianqi Chen
0c6bfa74b5 Merge pull request #315 from jseabold/sklearn-handle-missing
ENH: Allow settable missing value in sklearn api.
2015-05-18 17:00:53 +02:00
Tianqi Chen
01175a415a Merge pull request #323 from jseabold/fix-errors
BUG: XGBError -> XGBoostError
2015-05-18 17:00:08 +02:00
Skipper Seabold
a17cb2339e BUG: XGBError -> XGBoostError 2015-05-18 09:09:22 -05:00
Skipper Seabold
0a0a80ec72 ENH: Allow settable missing value in sklearn api. 2015-05-18 09:06:09 -05:00
tqchen
91a5390929 checkin copy 2015-05-17 21:29:51 -07:00
pommedeterresautee
1ea7f6f033 fix bug 2015-05-17 20:37:15 +02:00
pommedeterresautee
947afd7eac multi trees 2015-05-17 15:16:28 +02:00
tqchen
e6b8b23a2c allow booster to be pickable, add copy function 2015-05-16 12:59:55 -07:00
tqchen
39f1da08d2 Merge branch 'master' of ssh://github.com/dmlc/xgboost 2015-05-15 23:54:40 -07:00
tqchen
09a841f810 auto turn on optimization 2015-05-15 23:54:34 -07:00
tqchen
792cff5abc checkin some micro optimization 2015-05-15 23:54:03 -07:00
Tianqi Chen
f49525ee95 Merge pull request #319 from jdwittenauer/master
Add classes_ attribute to scikit-learn wrapper
2015-05-15 22:03:18 -07:00
John Wittenauer
4e080928a8 Added classes_ attribute to scikit-learn wrapper. 2015-05-15 21:19:39 -04:00
Tianqi Chen
9c52fc8e22 Merge pull request #314 from enizhibitsky/wrapper_stopping_fix
Fix early stopping in python wrapper
2015-05-14 16:16:47 -07:00
Tianqi Chen
019ab50994 Merge pull request #313 from alexchao56/master
Updated grammar for the README.md
2015-05-14 16:16:13 -07:00
Eugene Nizhibitsky
b63868327f Fix early stopping in python wrapper 2015-05-14 22:55:49 +03:00
Alex Chao
e080c663a8 Updated grammar for the README.md 2015-05-14 11:57:50 -07:00
tqchen
3a7808dc7d remove print 2015-05-13 23:34:09 -07:00
Tianqi Chen
49ad633530 Update xgboost.py 2015-05-13 23:15:19 -07:00
Tong He
e03ef41829 Merge pull request #312 from by321/master
xgb.cv( printEveryN ) parameter to print every n-th progress message
2015-05-13 22:18:47 -07:00
by321
a4341f22a2 xgb.csv(printEveryN) parameter to print every n-th progress message 2015-05-13 21:51:05 -07:00
tqchen
b8b0243d95 Merge branch 'master' of ssh://github.com/dmlc/xgboost 2015-05-12 20:21:00 -07:00
tqchen
62801f5343 allow fpic 2015-05-12 20:20:30 -07:00
289 changed files with 14644 additions and 4347 deletions

21
.gitignore vendored
View File

@@ -48,13 +48,26 @@ Debug
*.cpage.col
*.cpage
*.Rproj
xgboost
xgboost.mpi
xgboost.mock
train*
./xgboost
./xgboost.mpi
./xgboost.mock
rabit
#.Rbuildignore
R-package.Rproj
*.cache*
R-package/inst
R-package/src
#java
java/xgboost4j/target
java/xgboost4j/tmp
java/xgboost4j-demo/target
java/xgboost4j-demo/data/
java/xgboost4j-demo/tmp/
java/xgboost4j-demo/model/
nb-configuration*
dmlc-core
# Eclipse
.project
.cproject
.pydevproject
.settings/

58
.travis.yml Normal file
View File

@@ -0,0 +1,58 @@
sudo: true
# Enabling test on Linux and OS X
os:
- linux
- osx
# Use Build Matrix to do lint and build seperately
env:
matrix:
- TASK=lint LINT_LANG=cpp
- TASK=lint LINT_LANG=python
- TASK=R-package CXX=g++
- TASK=python-package CXX=g++
- TASK=python-package3 CXX=g++
- TASK=java-package CXX=g++
- TASK=build CXX=g++
- TASK=build-with-dmlc CXX=g++
os:
- linux
- osx
# dependent apt packages
addons:
apt:
packages:
- doxygen
- libopenmpi-dev
- wget
- libcurl4-openssl-dev
- unzip
- python-numpy
- python-scipy
before_install:
- scripts/travis_osx_install.sh
- git clone https://github.com/dmlc/dmlc-core
- export TRAVIS=dmlc-core/scripts/travis/
- export PYTHONPATH=${PYTHONPATH}:${PWD}/python-package
- source ${TRAVIS}/travis_setup_env.sh
install:
- pip install cpplint pylint --user `whoami`
script: scripts/travis_script.sh
after_failure:
- scripts/travis_after_failure.sh
notifications:
email:
on_success: change
on_failure: always

View File

@@ -1,18 +1,18 @@
Change Log
=====
==========
xgboost-0.1
=====
-----------
* Initial release
xgboost-0.2x
=====
------------
* Python module
* Weighted samples instances
* Initial version of pairwise rank
xgboost-0.3
=====
-----------
* Faster tree construction module
- Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
* Support for boosting from initial predictions
@@ -22,7 +22,7 @@ xgboost-0.3
* Add R module
xgboost-0.4
=====
-----------
* Distributed version of xgboost that runs on YARN, scales to billions of examples
* Direct save/load data and model from/to S3 and HDFS
* Feature importance visualization in R module, by Michael Benesty
@@ -34,3 +34,28 @@ xgboost-0.4
- xgboost python model is now pickable
* sklearn wrapper is supported in python module
* Experimental External memory version
xgboost-0.47
------------
* Changes in R library
- fixed possible problem of poisson regression.
- switched from 0 to NA for missing values.
- exposed access to additional model parameters.
* Changes in Python library
- throws exception instead of crash terminal when a parameter error happens.
- has importance plot and tree plot functions.
- accepts different learning rates for each boosting round.
- allows model training continuation from previously saved model.
- allows early stopping in CV.
- allows feval to return a list of tuples.
- allows eval_metric to handle additional format.
- improved compatibility in sklearn module.
- additional parameters added for sklearn wrapper.
- added pip installation functionality.
- supports more Pandas DataFrame dtypes.
- added best_ntree_limit attribute, in addition to best_score and best_iteration.
* Java api is ready for use
* Added more test cases and continuous integration to make each build more robust.
on going at master
------------------

61
CONTRIBUTORS.md Normal file
View File

@@ -0,0 +1,61 @@
Contributors of DMLC/XGBoost
============================
XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users.
Comitters
---------
Committers are people who have made substantial contribution to the project and granted write access to the project.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a PhD working on large-scale machine learning, he is the creator of the project.
* [Tong He](https://github.com/hetong007), Simon Fraser University
- Tong is a master student working on data mining, he is the maintainer of xgboost R package.
* [Bing Xu](https://github.com/antinucleon)
- Bing is the original creator of xgboost python package and currently the maintainer of [XGBoost.jl](https://github.com/antinucleon/XGBoost.jl).
* [Michael Benesty](https://github.com/pommedeterresautee)
- Micheal is a lawyer, data scientist in France, he is the creator of xgboost interactive analysis module in R.
* [Yuan Tang](https://github.com/terrytangyuan)
- Yuan is a data scientist in Chicago, US. He contributed mostly in R and Python packages.
Become a Comitter
-----------------
XGBoost is a opensource project and we are actively looking for new comitters who are willing to help maintaining and lead the project.
Committers comes from contributors who:
* Made substantial contribution to the project.
* Willing to spent time on maintaining and lead the project.
New committers will be proposed by current comitter memembers, with support from more than two of current comitters.
List of Contributors
--------------------
* [Full List of Contributors](https://github.com/dmlc/xgboost/graphs/contributors)
- To contributors: please add your name to the list when you submit a patch to the project:)
* [Kailong Chen](https://github.com/kalenhaha)
- Kailong is an early contributor of xgboost, he is creator of ranking objectives in xgboost.
* [Skipper Seabold](https://github.com/jseabold)
- Skipper is the major contributor to the scikit-learn module of xgboost.
* [Zygmunt Zając](https://github.com/zygmuntz)
- Zygmunt is the master behind the early stopping feature frequently used by kagglers.
* [Ajinkya Kale](https://github.com/ajkl)
* [Boliang Chen](https://github.com/cblsjtu)
* [Vadim Khotilovich](https://github.com/khotilov)
* [Yangqing Men](https://github.com/yanqingmen)
- Yangqing is the creator of xgboost java package.
* [Engpeng Yao](https://github.com/yepyao)
* [Giulio](https://github.com/giuliohome)
- Giulio is the creator of windows project of xgboost
* [Jamie Hall](https://github.com/nerdcha)
- Jamie is the initial creator of xgboost sklearn modue.
* [Yen-Ying Lee](https://github.com/white1033)
* [Masaaki Horikoshi](https://github.com/sinhrks)
- Masaaki is the initial creator of xgboost python plotting module.
* [Hongliang Liu](https://github.com/phunterlau)
- Hongliang is the maintainer of xgboost python PyPI package for pip installation.
* [daiyl0320](https://github.com/daiyl0320)
- daiyl0320 contributed patch to xgboost distributed version more robust, and scales stably on TB scale datasets.
* [Huayi Zhang](https://github.com/irachex)
* [Johan Manders](https://github.com/johanmanders)
* [yoori](https://github.com/yoori)
* [Mathias Müller](https://github.com/far0n)
* [Sam Thomson](https://github.com/sammthomson)
* [ganesh-krishnan](https://github.com/ganesh-krishnan)
* [Damien Carol](https://github.com/damiencarol)

View File

@@ -1,4 +1,4 @@
Copyright (c) 2014 by Tianqi Chen and Contributors
Copyright (c) 2014 by Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

101
Makefile
View File

@@ -1,24 +1,45 @@
export CC = gcc
export CXX = g++
export CC = $(if $(shell which gcc-5 2>/dev/null),gcc-5,gcc)
export CXX = $(if $(shell which g++-5 2>/dev/null),g++-5,g++)
export MPICXX = mpicxx
export LDFLAGS= -pthread -lm
export CFLAGS = -Wall -O3 -msse2 -Wno-unknown-pragmas -fPIC
export CFLAGS = -Wall -O3 -msse2 -Wno-unknown-pragmas -funroll-loops
# java include path
export JAVAINCFLAGS = -I${JAVA_HOME}/include -I./java
ifeq ($(OS), Windows_NT)
export CXX = g++ -m64
export CC = gcc -m64
endif
UNAME= $(shell uname)
ifeq ($(UNAME), Linux)
LDFLAGS += -lrt
JAVAINCFLAGS += -I${JAVA_HOME}/include/linux
endif
ifeq ($(UNAME), Darwin)
JAVAINCFLAGS += -I${JAVA_HOME}/include/darwin
endif
ifeq ($(no_omp),1)
CFLAGS += -DDISABLE_OPENMP
else
CFLAGS += -fopenmp
#CFLAGS += -fopenmp
ifeq ($(omp_mac_static),1)
#CFLAGS += -fopenmp -Bstatic
CFLAGS += -static-libgcc -static-libstdc++ -L. -fopenmp
#LDFLAGS += -Wl,--whole-archive -lpthread -Wl --no-whole-archive
else
CFLAGS += -fopenmp
endif
endif
# by default use c++11
ifeq ($(cxx11),1)
CFLAGS += -std=c++11
else
endif
# handling dmlc
@@ -38,6 +59,14 @@ else
LIBDMLC=dmlc_simple.o
endif
ifndef WITH_FPIC
WITH_FPIC = 1
endif
ifeq ($(WITH_FPIC), 1)
CFLAGS += -fPIC
endif
ifeq ($(OS), Windows_NT)
LIBRABIT = subtree/rabit/lib/librabit_empty.a
SLIB = wrapper/xgboost_wrapper.dll
@@ -46,16 +75,27 @@ else
SLIB = wrapper/libxgboostwrapper.so
endif
# java lib
JLIB = java/libxgboost4j.so
# specify tensor path
BIN = xgboost
MOCKBIN = xgboost.mock
OBJ = updater.o gbm.o io.o main.o dmlc_simple.o
MPIBIN =
TARGET = $(BIN) $(OBJ) $(SLIB)
ifeq ($(WITH_FPIC), 1)
TARGET = $(BIN) $(OBJ) $(SLIB)
else
TARGET = $(BIN)
endif
.PHONY: clean all mpi python Rpack
ifndef LINT_LANG
LINT_LANG= "all"
endif
all: $(BIN) $(OBJ) $(SLIB)
.PHONY: clean all mpi python Rpack lint
all: $(TARGET)
mpi: $(MPIBIN)
python: wrapper/libxgboostwrapper.so
@@ -68,6 +108,9 @@ main.o: src/xgboost_main.cpp src/utils/*.h src/*.h src/learner/*.hpp src/learner
xgboost: updater.o gbm.o io.o main.o $(LIBRABIT) $(LIBDMLC)
wrapper/xgboost_wrapper.dll wrapper/libxgboostwrapper.so: wrapper/xgboost_wrapper.cpp src/utils/*.h src/*.h src/learner/*.hpp src/learner/*.h updater.o gbm.o io.o $(LIBRABIT) $(LIBDMLC)
java: java/libxgboost4j.so
java/libxgboost4j.so: java/xgboost4j_wrapper.cpp wrapper/xgboost_wrapper.cpp src/utils/*.h src/*.h src/learner/*.hpp src/learner/*.h updater.o gbm.o io.o $(LIBRABIT) $(LIBDMLC)
# dependency on rabit
subtree/rabit/lib/librabit.a: subtree/rabit/src/engine.cc
+ cd subtree/rabit;make lib/librabit.a; cd ../..
@@ -79,7 +122,7 @@ subtree/rabit/lib/librabit_mpi.a: subtree/rabit/src/engine_mpi.cc
+ cd subtree/rabit;make lib/librabit_mpi.a; cd ../..
$(BIN) :
$(CXX) $(CFLAGS) -o $@ $(filter %.cpp %.o %.c %.cc %.a, $^) $(LDFLAGS)
$(CXX) $(CFLAGS) -fPIC -o $@ $(filter %.cpp %.o %.c %.cc %.a, $^) $(LDFLAGS)
$(MOCKBIN) :
$(CXX) $(CFLAGS) -o $@ $(filter %.cpp %.o %.c %.cc %.a, $^) $(LDFLAGS)
@@ -87,6 +130,9 @@ $(MOCKBIN) :
$(SLIB) :
$(CXX) $(CFLAGS) -fPIC -shared -o $@ $(filter %.cpp %.o %.c %.a %.cc, $^) $(LDFLAGS) $(DLLFLAGS)
$(JLIB) :
$(CXX) $(CFLAGS) -fPIC -shared -o $@ $(filter %.cpp %.o %.c %.a %.cc, $^) $(LDFLAGS) $(JAVAINCFLAGS)
$(OBJ) :
$(CXX) -c $(CFLAGS) -o $@ $(firstword $(filter %.cpp %.c %.cc, $^) )
@@ -122,10 +168,47 @@ Rpack:
cat R-package/src/Makevars|sed '2s/.*/PKGROOT=./' > xgboost/src/Makevars
cp xgboost/src/Makevars xgboost/src/Makevars.win
# R CMD build --no-build-vignettes xgboost
# R CMD build xgboost
# rm -rf xgboost
# R CMD check --as-cran xgboost*.tar.gz
Rbuild:
make Rpack
R CMD build xgboost
rm -rf xgboost
Rcheck:
make Rbuild
R CMD check --as-cran xgboost*.tar.gz
pythonpack:
#for pip maintainer only
cd subtree/rabit;make clean;cd ..
rm -rf xgboost-deploy xgboost*.tar.gz
cp -r python-package xgboost-deploy
#cp *.md xgboost-deploy/
cp LICENSE xgboost-deploy/
cp Makefile xgboost-deploy/xgboost
cp -r wrapper xgboost-deploy/xgboost
cp -r subtree xgboost-deploy/xgboost
cp -r multi-node xgboost-deploy/xgboost
cp -r windows xgboost-deploy/xgboost
cp -r src xgboost-deploy/xgboost
cp python-package/setup_pip.py xgboost-deploy/setup.py
#make python
pythonbuild:
make pythonpack
python setup.py install
pythoncheck:
make pythonbuild
python -c 'import xgboost;print xgboost.core.find_lib_path()'
# lint requires dmlc to be in current folder
lint:
dmlc-core/scripts/lint.py xgboost $(LINT_LANG) src wrapper R-package python-package
clean:
$(RM) -rf $(OBJ) $(BIN) $(MPIBIN) $(MPIOBJ) $(SLIB) *.o */*.o */*/*.o *~ */*~ */*/*~
cd subtree/rabit; make clean; cd ..

View File

@@ -3,3 +3,4 @@
\.dll$
^.*\.Rproj$
^\.Rproj\.user$
README.md

View File

@@ -1,18 +1,18 @@
Package: xgboost
Type: Package
Title: eXtreme Gradient Boosting
Version: 0.4-0
Date: 2015-05-11
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>, Michael Benesty <michael@benesty.fr>
Title: Extreme Gradient Boosting
Version: 0.4-2
Date: 2015-08-01
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>,
Michael Benesty <michael@benesty.fr>
Maintainer: Tong He <hetong007@gmail.com>
Description: Xgboost is short for eXtreme Gradient Boosting, which is an
efficient and scalable implementation of gradient boosting framework.
This package is an R wrapper of xgboost. The package includes efficient
linear model solver and tree learning algorithms. The package can automatically
do parallel computation with OpenMP, and it can be more than 10 times faster
than existing gradient boosting packages such as gbm. It supports various
objective functions, including regression, classification and ranking. The
package is made to be extensible, so that users are also allowed to define
Description: Extreme Gradient Boosting, which is an efficient implementation
of gradient boosting framework. This package is its R interface. The package
includes efficient linear model solver and tree learning algorithms. The package
can automatically do parallel computation on a single machine which could be
more than 10 times faster than existing gradient boosting packages. It supports
various objective functions, including regression, classification and ranking.
The package is made to be extensible, so that users are also allowed to define
their own objectives easily.
License: Apache License (== 2.0) | file LICENSE
URL: https://github.com/dmlc/xgboost
@@ -20,15 +20,18 @@ BugReports: https://github.com/dmlc/xgboost/issues
VignetteBuilder: knitr
Suggests:
knitr,
ggplot2 (>= 1.0.0),
DiagrammeR (>= 0.6),
ggplot2 (>= 1.0.1),
DiagrammeR (>= 0.8.1),
Ckmeans.1d.dp (>= 3.3.1),
vcd (>= 1.3)
vcd (>= 1.3),
testthat,
igraph (>= 1.0.1)
Depends:
R (>= 2.10)
Imports:
Matrix (>= 1.1-0),
methods,
data.table (>= 1.9.4),
data.table (>= 1.9.6),
magrittr (>= 1.5),
stringr (>= 0.6.2)
RoxygenNote: 5.0.1

View File

@@ -1,16 +1,19 @@
# Generated by roxygen2 (4.1.1): do not edit by hand
# Generated by roxygen2: do not edit by hand
export(getinfo)
export(setinfo)
export(slice)
export(xgb.DMatrix)
export(xgb.DMatrix.save)
export(xgb.create.features)
export(xgb.cv)
export(xgb.dump)
export(xgb.importance)
export(xgb.load)
export(xgb.model.dt.tree)
export(xgb.plot.deepness)
export(xgb.plot.importance)
export(xgb.plot.multi.trees)
export(xgb.plot.tree)
export(xgb.save)
export(xgb.save.raw)
@@ -23,6 +26,7 @@ importClassesFrom(Matrix,dgCMatrix)
importClassesFrom(Matrix,dgeMatrix)
importFrom(Matrix,cBind)
importFrom(Matrix,colSums)
importFrom(Matrix,sparse.model.matrix)
importFrom(Matrix,sparseVector)
importFrom(data.table,":=")
importFrom(data.table,as.data.table)
@@ -35,6 +39,7 @@ importFrom(data.table,setnames)
importFrom(magrittr,"%>%")
importFrom(magrittr,add)
importFrom(magrittr,not)
importFrom(stringr,str_detect)
importFrom(stringr,str_extract)
importFrom(stringr,str_extract_all)
importFrom(stringr,str_match)

View File

@@ -23,7 +23,6 @@ setClass('xgb.DMatrix')
#' stopifnot(all(labels2 == 1-labels))
#' @rdname getinfo
#' @export
#'
getinfo <- function(object, ...){
UseMethod("getinfo")
}
@@ -54,4 +53,3 @@ setMethod("getinfo", signature = "xgb.DMatrix",
}
return(ret)
})

View File

@@ -20,6 +20,17 @@ setClass("xgb.Booster",
#' only valid for gbtree, but not for gblinear. set it to be value bigger
#' than 0. It will use all trees by default.
#' @param predleaf whether predict leaf index instead. If set to TRUE, the output will be a matrix object.
#'
#' @details
#' The option \code{ntreelimit} purpose is to let the user train a model with lots
#' of trees but use only the first trees for prediction to avoid overfitting
#' (without having to train a new model with less trees).
#'
#' The option \code{predleaf} purpose is inspired from §3.1 of the paper
#' \code{Practical Lessons from Predicting Clicks on Ads at Facebook}.
#' The idea is to use the model as a generator of new features which capture non linear link
#' from original features.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
@@ -29,9 +40,8 @@ setClass("xgb.Booster",
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#' pred <- predict(bst, test$data)
#' @export
#'
setMethod("predict", signature = "xgb.Booster",
definition = function(object, newdata, missing = NULL,
definition = function(object, newdata, missing = NA,
outputmargin = FALSE, ntreelimit = NULL, predleaf = FALSE) {
if (class(object) != "xgb.Booster"){
stop("predict: model in prediction must be of class xgb.Booster")
@@ -39,11 +49,7 @@ setMethod("predict", signature = "xgb.Booster",
object <- xgb.Booster.check(object, saveraw = FALSE)
}
if (class(newdata) != "xgb.DMatrix") {
if (is.null(missing)) {
newdata <- xgb.DMatrix(newdata)
} else {
newdata <- xgb.DMatrix(newdata, missing = missing)
}
newdata <- xgb.DMatrix(newdata, missing = missing)
}
if (is.null(ntreelimit)) {
ntreelimit <- 0
@@ -52,7 +58,7 @@ setMethod("predict", signature = "xgb.Booster",
stop("predict: ntreelimit must be equal to or greater than 1")
}
}
option = 0
option <- 0
if (outputmargin) {
option <- option + 1
}
@@ -72,4 +78,3 @@ setMethod("predict", signature = "xgb.Booster",
}
return(ret)
})

View File

@@ -13,7 +13,6 @@ setMethod("predict", signature = "xgb.Booster.handle",
bst <- xgb.handleToBooster(object)
ret = predict(bst, ...)
ret <- predict(bst, ...)
return(ret)
})

View File

@@ -21,7 +21,6 @@
#' stopifnot(all(labels2 == 1-labels))
#' @rdname setinfo
#' @export
#'
setinfo <- function(object, ...){
UseMethod("setinfo")
}

View File

@@ -13,7 +13,6 @@ setClass('xgb.DMatrix')
#' dsub <- slice(dtrain, 1:3)
#' @rdname slice
#' @export
#'
slice <- function(object, ...){
UseMethod("slice")
}
@@ -34,8 +33,8 @@ setMethod("slice", signature = "xgb.DMatrix",
attr_list <- attributes(object)
nr <- xgb.numrow(object)
len <- sapply(attr_list,length)
ind <- which(len==nr)
if (length(ind)>0) {
ind <- which(len == nr)
if (length(ind) > 0) {
nms <- names(attr_list)[ind]
for (i in 1:length(ind)) {
attr(ret,nms[i]) <- attr(object,nms[i])[idxset]

View File

@@ -1,4 +1,4 @@
#' @importClassesFrom Matrix dgCMatrix dgeMatrix
#' @importClassesFrom Matrix dgCMatrix dgeMatrix
#' @import methods
# depends on matrix
@@ -15,14 +15,14 @@ xgb.setinfo <- function(dmat, name, info) {
stop("xgb.setinfo: first argument dtrain must be xgb.DMatrix")
}
if (name == "label") {
if (length(info)!=xgb.numrow(dmat))
if (length(info) != xgb.numrow(dmat))
stop("The length of labels must equal to the number of rows in the input data")
.Call("XGDMatrixSetInfo_R", dmat, name, as.numeric(info),
PACKAGE = "xgboost")
return(TRUE)
}
if (name == "weight") {
if (length(info)!=xgb.numrow(dmat))
if (length(info) != xgb.numrow(dmat))
stop("The length of weights must equal to the number of rows in the input data")
.Call("XGDMatrixSetInfo_R", dmat, name, as.numeric(info),
PACKAGE = "xgboost")
@@ -36,7 +36,7 @@ xgb.setinfo <- function(dmat, name, info) {
return(TRUE)
}
if (name == "group") {
if (sum(info)!=xgb.numrow(dmat))
if (sum(info) != xgb.numrow(dmat))
stop("The sum of groups must equal to the number of rows in the input data")
.Call("XGDMatrixSetInfo_R", dmat, name, as.integer(info),
PACKAGE = "xgboost")
@@ -103,16 +103,15 @@ xgb.Booster.check <- function(bst, saveraw = TRUE)
## ----the following are low level iteratively function, not needed if
## you do not want to use them ---------------------------------------
# get dmatrix from data, label
xgb.get.DMatrix <- function(data, label = NULL, missing = NULL) {
xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL) {
inClass <- class(data)
if (inClass == "dgCMatrix" || inClass == "matrix") {
if (is.null(label)) {
stop("xgboost: need label when data is a matrix")
}
if (is.null(missing)){
dtrain <- xgb.DMatrix(data, label = label)
} else {
dtrain <- xgb.DMatrix(data, label = label, missing = missing)
dtrain <- xgb.DMatrix(data, label = label, missing = missing)
if (!is.null(weight)){
xgb.setinfo(dtrain, "weight", weight)
}
} else {
if (!is.null(label)) {
@@ -122,6 +121,9 @@ xgb.get.DMatrix <- function(data, label = NULL, missing = NULL) {
dtrain <- xgb.DMatrix(data)
} else if (inClass == "xgb.DMatrix") {
dtrain <- data
} else if (inClass == "data.frame") {
stop("xgboost only support numerical matrix input,
use 'data.matrix' to transform the data.")
} else {
stop("xgboost: Invalid input of data")
}
@@ -140,8 +142,7 @@ xgb.iter.boost <- function(booster, dtrain, gpair) {
if (class(dtrain) != "xgb.DMatrix") {
stop("xgb.iter.update: second argument must be type xgb.DMatrix")
}
.Call("XGBoosterBoostOneIter_R", booster, dtrain, gpair$grad, gpair$hess,
PACKAGE = "xgboost")
.Call("XGBoosterBoostOneIter_R", booster, dtrain, gpair$grad, gpair$hess, PACKAGE = "xgboost")
return(TRUE)
}
@@ -157,7 +158,7 @@ xgb.iter.update <- function(booster, dtrain, iter, obj = NULL) {
if (is.null(obj)) {
.Call("XGBoosterUpdateOneIter_R", booster, as.integer(iter), dtrain,
PACKAGE = "xgboost")
} else {
} else {
pred <- predict(booster, dtrain)
gpair <- obj(pred, dtrain)
succ <- xgb.iter.boost(booster, dtrain, gpair)
@@ -220,7 +221,8 @@ xgb.cv.mknfold <- function(dall, nfold, param, stratified, folds) {
stop("nfold must be bigger than 1")
}
if(is.null(folds)) {
if (exists('objective', where=param) && strtrim(param[['objective']], 5) == 'rank:') {
if (exists('objective', where=param) && is.character(param$objective) &&
strtrim(param[['objective']], 5) == 'rank:') {
stop("\tAutomatic creation of CV-folds is not implemented for ranking!\n",
"\tConsider providing pre-computed CV-folds through the folds parameter.")
}
@@ -234,7 +236,7 @@ xgb.cv.mknfold <- function(dall, nfold, param, stratified, folds) {
# For classification, need to convert y labels to factor before making the folds,
# and then do stratification by factor levels.
# For regression, leave y numeric and do stratification by quantiles.
if (exists('objective', where=param)) {
if (exists('objective', where=param) && is.character(param$objective)) {
# If 'objective' provided in params, assume that y is a classification label
# unless objective is reg:linear
if (param[['objective']] != 'reg:linear') y <- factor(y)
@@ -249,17 +251,17 @@ xgb.cv.mknfold <- function(dall, nfold, param, stratified, folds) {
# make simple non-stratified folds
kstep <- length(randidx) %/% nfold
folds <- list()
for (i in 1:(nfold-1)) {
folds[[i]] = randidx[1:kstep]
randidx = setdiff(randidx, folds[[i]])
for (i in 1:(nfold - 1)) {
folds[[i]] <- randidx[1:kstep]
randidx <- setdiff(randidx, folds[[i]])
}
folds[[nfold]] = randidx
folds[[nfold]] <- randidx
}
}
ret <- list()
for (k in 1:nfold) {
dtest <- slice(dall, folds[[k]])
didx = c()
didx <- c()
for (i in 1:nfold) {
if (i != k) {
didx <- append(didx, folds[[i]])
@@ -267,7 +269,7 @@ xgb.cv.mknfold <- function(dall, nfold, param, stratified, folds) {
}
dtrain <- slice(dall, didx)
bst <- xgb.Booster(param, list(dtrain, dtest))
watchlist = list(train=dtrain, test=dtest)
watchlist <- list(train=dtrain, test=dtest)
ret[[k]] <- list(dtrain=dtrain, booster=bst, watchlist=watchlist, index=folds[[k]])
}
return (ret)
@@ -287,7 +289,7 @@ xgb.cv.aggcv <- function(res, showsd = TRUE) {
}
ret <- paste(ret, sprintf("%f", mean(stats)), sep="")
if (showsd) {
ret <- paste(ret, sprintf("+%f", sd(stats)), sep="")
ret <- paste(ret, sprintf("+%f", stats::sd(stats)), sep="")
}
}
return (ret)
@@ -308,11 +310,11 @@ xgb.createFolds <- function(y, k = 10)
## At most, we will use quantiles. If the sample
## is too small, we just do regular unstratified
## CV
cuts <- floor(length(y)/k)
if(cuts < 2) cuts <- 2
if(cuts > 5) cuts <- 5
cuts <- floor(length(y) / k)
if (cuts < 2) cuts <- 2
if (cuts > 5) cuts <- 5
y <- cut(y,
unique(quantile(y, probs = seq(0, 1, length = cuts))),
unique(stats::quantile(y, probs = seq(0, 1, length = cuts))),
include.lowest = TRUE)
}

View File

@@ -17,8 +17,7 @@
#' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
#' dtrain <- xgb.DMatrix('xgb.DMatrix.data')
#' @export
#'
xgb.DMatrix <- function(data, info = list(), missing = 0, ...) {
xgb.DMatrix <- function(data, info = list(), missing = NA, ...) {
if (typeof(data) == "character") {
handle <- .Call("XGDMatrixCreateFromFile_R", data, as.integer(FALSE),
PACKAGE = "xgboost")

View File

@@ -12,7 +12,6 @@
#' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
#' dtrain <- xgb.DMatrix('xgb.DMatrix.data')
#' @export
#'
xgb.DMatrix.save <- function(DMatrix, fname) {
if (typeof(fname) != "character") {
stop("xgb.save: fname must be character")

View File

@@ -0,0 +1,91 @@
#' Create new features from a previously learned model
#'
#' May improve the learning by adding new features to the training data based on the decision trees from a previously learned model.
#'
#' @importFrom magrittr %>%
#' @importFrom Matrix cBind
#' @importFrom Matrix sparse.model.matrix
#'
#' @param model decision tree boosting model learned on the original data
#' @param training.data original data (usually provided as a \code{dgCMatrix} matrix)
#'
#' @return \code{dgCMatrix} matrix including both the original data and the new features.
#'
#' @details
#' This is the function inspired from the paragraph 3.1 of the paper:
#'
#' \strong{Practical Lessons from Predicting Clicks on Ads at Facebook}
#'
#' \emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers,
#' Joaquin Quiñonero Candela)}
#'
#' International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
#'
#' \url{https://research.facebook.com/publications/758569837499391/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
#'
#' Extract explaining the method:
#'
#' "\emph{We found that boosted decision trees are a powerful and very
#' convenient way to implement non-linear and tuple transformations
#' of the kind we just described. We treat each individual
#' tree as a categorical feature that takes as value the
#' index of the leaf an instance ends up falling in. We use
#' 1-of-K coding of this type of features.
#'
#' For example, consider the boosted tree model in Figure 1 with 2 subtrees,
#' where the first subtree has 3 leafs and the second 2 leafs. If an
#' instance ends up in leaf 2 in the first subtree and leaf 1 in
#' second subtree, the overall input to the linear classifier will
#' be the binary vector \code{[0, 1, 0, 1, 0]}, where the first 3 entries
#' correspond to the leaves of the first subtree and last 2 to
#' those of the second subtree.
#'
#' [...]
#'
#' We can understand boosted decision tree
#' based transformation as a supervised feature encoding that
#' converts a real-valued vector into a compact binary-valued
#' vector. A traversal from root node to a leaf node represents
#' a rule on certain features.}"
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#' dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
#' dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
#'
#' param <- list(max.depth=2, eta=1, silent=1, objective='binary:logistic')
#' nround = 4
#'
#' bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
#'
#' # Model accuracy without new features
#' accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
#'
#' # Convert previous features to one hot encoding
#' new.features.train <- xgb.create.features(model = bst, agaricus.train$data)
#' new.features.test <- xgb.create.features(model = bst, agaricus.test$data)
#'
#' # learning with new features
#' new.dtrain <- xgb.DMatrix(data = new.features.train, label = agaricus.train$label)
#' new.dtest <- xgb.DMatrix(data = new.features.test, label = agaricus.test$label)
#' watchlist <- list(train = new.dtrain)
#' bst <- xgb.train(params = param, data = new.dtrain, nrounds = nround, nthread = 2)
#'
#' # Model accuracy with new features
#' accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
#'
#' # Here the accuracy was already good and is now perfect.
#' cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\n"))
#'
#' @export
xgb.create.features <- function(model, training.data){
pred_with_leaf = predict(model, training.data, predleaf = TRUE)
cols <- list()
for(i in 1:length(trees)){
# max is not the real max but it s not important for the purpose of adding features
leaf.id <- sort(unique(pred_with_leaf[,i]))
cols[[i]] <- factor(x = pred_with_leaf[,i], level = leaf.id)
}
cBind(training.data, sparse.model.matrix( ~ . -1, as.data.frame(cols)))
}

View File

@@ -54,11 +54,11 @@
#' @param folds \code{list} provides a possibility of using a list of pre-defined CV folds (each element must be a vector of fold's indices).
#' If folds are supplied, the nfold and stratified parameters would be ignored.
#' @param verbose \code{boolean}, print the statistics during the process
#' @param early_stop_round If \code{NULL}, the early stopping function is not triggered.
#' @param print.every.n Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.
#' @param early.stop.round If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' keeps getting worse consecutively for \code{k} rounds.
#' @param early.stop.round An alternative of \code{early_stop_round}.
#' @param maximize If \code{feval} and \code{early_stop_round} are set, then \code{maximize} must be set as well.
#' @param maximize If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
#' \code{maximize=TRUE} means the larger the evaluation score the better.
#'
#' @param ... other parameters to pass to \code{params}.
@@ -90,140 +90,158 @@
#' max.depth =3, eta = 1, objective = "binary:logistic")
#' print(history)
#' @export
#'
xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing = NULL,
xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing = NA,
prediction = FALSE, showsd = TRUE, metrics=list(),
obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T,
early_stop_round = NULL, early.stop.round = NULL, maximize = NULL, ...) {
if (typeof(params) != "list") {
stop("xgb.cv: first argument params must be list")
}
if(!is.null(folds)) {
if(class(folds)!="list" | length(folds) < 2) {
stop("folds must be a list with 2 or more elements that are vectors of indices for each CV-fold")
obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = T, print.every.n=1L,
early.stop.round = NULL, maximize = NULL, ...) {
if (typeof(params) != "list") {
stop("xgb.cv: first argument params must be list")
}
if(!is.null(folds)) {
if(class(folds) != "list" | length(folds) < 2) {
stop("folds must be a list with 2 or more elements that are vectors of indices for each CV-fold")
}
nfold <- length(folds)
}
if (nfold <= 1) {
stop("nfold must be bigger than 1")
}
nfold <- length(folds)
}
if (nfold <= 1) {
stop("nfold must be bigger than 1")
}
if (is.null(missing)) {
dtrain <- xgb.get.DMatrix(data, label)
} else {
dtrain <- xgb.get.DMatrix(data, label, missing)
}
params <- append(params, list(...))
params <- append(params, list(silent=1))
for (mc in metrics) {
params <- append(params, list("eval_metric"=mc))
}
# Early Stopping
if (is.null(early_stop_round) && !is.null(early.stop.round))
early_stop_round = early.stop.round
if (!is.null(early_stop_round)){
if (!is.null(feval) && is.null(maximize))
stop('Please set maximize to note whether the model is maximizing the evaluation or not.')
if (is.null(maximize) && is.null(params$eval_metric))
stop('Please set maximize to note whether the model is maximizing the evaluation or not.')
if (is.null(maximize))
{
if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) {
maximize = FALSE
} else {
maximize = TRUE
}
dot.params <- list(...)
nms.params <- names(params)
nms.dot.params <- names(dot.params)
if (length(intersect(nms.params,nms.dot.params)) > 0)
stop("Duplicated defined term in parameters. Please check your list of params.")
params <- append(params, dot.params)
params <- append(params, list(silent=1))
for (mc in metrics) {
params <- append(params, list("eval_metric"=mc))
}
if (maximize) {
bestScore = 0
} else {
bestScore = Inf
}
bestInd = 0
earlyStopflag = FALSE
# customized objective and evaluation metric interface
if (!is.null(params$objective) && !is.null(obj))
stop("xgb.cv: cannot assign two different objectives")
if (!is.null(params$objective))
if (class(params$objective) == 'function') {
obj <- params$objective
params[['objective']] <- NULL
}
# if (!is.null(params$eval_metric) && !is.null(feval))
# stop("xgb.cv: cannot assign two different evaluation metrics")
if (!is.null(params$eval_metric))
if (class(params$eval_metric) == 'function') {
feval <- params$eval_metric
params[['eval_metric']] <- NULL
}
if (length(metrics)>1)
warning('Only the first metric is used for early stopping process.')
}
# Early Stopping
if (!is.null(early.stop.round)){
if (!is.null(feval) && is.null(maximize))
stop('Please set maximize to note whether the model is maximizing the evaluation or not.')
if (is.null(maximize) && is.null(params$eval_metric))
stop('Please set maximize to note whether the model is maximizing the evaluation or not.')
if (is.null(maximize))
{
if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) {
maximize <- FALSE
} else {
maximize <- TRUE
}
}
xgb_folds <- xgb.cv.mknfold(dtrain, nfold, params, stratified, folds)
obj_type = params[['objective']]
mat_pred = FALSE
if (!is.null(obj_type) && obj_type=='multi:softprob')
{
num_class = params[['num_class']]
if (is.null(num_class))
stop('must set num_class to use softmax')
predictValues <- matrix(0,xgb.numrow(dtrain),num_class)
mat_pred = TRUE
}
else
predictValues <- rep(0,xgb.numrow(dtrain))
history <- c()
for (i in 1:nrounds) {
msg <- list()
for (k in 1:nfold) {
fd <- xgb_folds[[k]]
succ <- xgb.iter.update(fd$booster, fd$dtrain, i - 1, obj)
if (i<nrounds) {
msg[[k]] <- xgb.iter.eval(fd$booster, fd$watchlist, i - 1, feval) %>% str_split("\t") %>% .[[1]]
} else {
if (!prediction) {
msg[[k]] <- xgb.iter.eval(fd$booster, fd$watchlist, i - 1, feval) %>% str_split("\t") %>% .[[1]]
if (maximize) {
bestScore <- 0
} else {
res <- xgb.iter.eval(fd$booster, fd$watchlist, i - 1, feval, prediction)
if (mat_pred) {
pred_mat = matrix(res[[2]],num_class,length(fd$index))
predictValues[fd$index,] <- t(pred_mat)
} else {
predictValues[fd$index] <- res[[2]]
}
msg[[k]] <- res[[1]] %>% str_split("\t") %>% .[[1]]
bestScore <- Inf
}
}
}
ret <- xgb.cv.aggcv(msg, showsd)
history <- c(history, ret)
if(verbose) paste(ret, "\n", sep="") %>% cat
bestInd <- 0
earlyStopflag <- FALSE
# early_Stopping
if (!is.null(early_stop_round)){
score = strsplit(ret,'\\s+')[[1]][1+length(metrics)+1]
score = strsplit(score,'\\+|:')[[1]][[2]]
score = as.numeric(score)
if ((maximize && score>bestScore) || (!maximize && score<bestScore)) {
bestScore = score
bestInd = i
} else {
if (i-bestInd>=early_stop_round) {
earlyStopflag = TRUE
cat('Stopping. Best iteration:',bestInd)
break
}
}
if (length(metrics) > 1)
warning('Only the first metric is used for early stopping process.')
}
}
xgb_folds <- xgb.cv.mknfold(dtrain, nfold, params, stratified, folds)
obj_type <- params[['objective']]
mat_pred <- FALSE
if (!is.null(obj_type) && obj_type == 'multi:softprob')
{
num_class <- params[['num_class']]
if (is.null(num_class))
stop('must set num_class to use softmax')
predictValues <- matrix(0,xgb.numrow(dtrain),num_class)
mat_pred <- TRUE
}
else
predictValues <- rep(0,xgb.numrow(dtrain))
history <- c()
print.every.n <- max(as.integer(print.every.n), 1L)
for (i in 1:nrounds) {
msg <- list()
for (k in 1:nfold) {
fd <- xgb_folds[[k]]
succ <- xgb.iter.update(fd$booster, fd$dtrain, i - 1, obj)
msg[[k]] <- xgb.iter.eval(fd$booster, fd$watchlist, i - 1, feval) %>% str_split("\t") %>% .[[1]]
}
ret <- xgb.cv.aggcv(msg, showsd)
history <- c(history, ret)
if(verbose)
if (0 == (i - 1L) %% print.every.n)
cat(ret, "\n", sep="")
colnames <- str_split(string = history[1], pattern = "\t")[[1]] %>% .[2:length(.)] %>% str_extract(".*:") %>% str_replace(":","") %>% str_replace("-", ".")
colnamesMean <- paste(colnames, "mean")
if(showsd) colnamesStd <- paste(colnames, "std")
# early_Stopping
if (!is.null(early.stop.round)){
score <- strsplit(ret,'\\s+')[[1]][2 + length(metrics)]
score <- strsplit(score,'\\+|:')[[1]][[2]]
score <- as.numeric(score)
if ( (maximize && score > bestScore) || (!maximize && score < bestScore)) {
bestScore <- score
bestInd <- i
} else {
if (i - bestInd >= early.stop.round) {
earlyStopflag <- TRUE
cat('Stopping. Best iteration:', bestInd, '\n')
break
}
}
}
}
colnames <- c()
if(showsd) for(i in 1:length(colnamesMean)) colnames <- c(colnames, colnamesMean[i], colnamesStd[i])
else colnames <- colnamesMean
if (prediction) {
for (k in 1:nfold) {
fd <- xgb_folds[[k]]
if (!is.null(early.stop.round) && earlyStopflag) {
res <- xgb.iter.eval(fd$booster, fd$watchlist, bestInd - 1, feval, prediction)
} else {
res <- xgb.iter.eval(fd$booster, fd$watchlist, nrounds - 1, feval, prediction)
}
if (mat_pred) {
pred_mat <- matrix(res[[2]],num_class,length(fd$index))
predictValues[fd$index,] <- t(pred_mat)
} else {
predictValues[fd$index] <- res[[2]]
}
}
}
type <- rep(x = "numeric", times = length(colnames))
dt <- read.table(text = "", colClasses = type, col.names = colnames) %>% as.data.table
split <- str_split(string = history, pattern = "\t")
colnames <- str_split(string = history[1], pattern = "\t")[[1]] %>% .[2:length(.)] %>% str_extract(".*:") %>% str_replace(":","") %>% str_replace("-", ".")
colnamesMean <- paste(colnames, "mean")
if(showsd) colnamesStd <- paste(colnames, "std")
for(line in split) dt <- line[2:length(line)] %>% str_extract_all(pattern = "\\d*\\.+\\d*") %>% unlist %>% as.numeric %>% as.list %>% {rbindlist(list(dt, .), use.names = F, fill = F)}
colnames <- c()
if(showsd) for(i in 1:length(colnamesMean)) colnames <- c(colnames, colnamesMean[i], colnamesStd[i])
else colnames <- colnamesMean
if (prediction) {
return(list(dt = dt,pred = predictValues))
}
return(dt)
type <- rep(x = "numeric", times = length(colnames))
dt <- utils::read.table(text = "", colClasses = type, col.names = colnames) %>% as.data.table
split <- str_split(string = history, pattern = "\t")
for(line in split) dt <- line[2:length(line)] %>% str_extract_all(pattern = "\\d*\\.+\\d*") %>% unlist %>% as.numeric %>% as.list %>% {rbindlist( list( dt, .), use.names = F, fill = F)}
if (prediction) {
return( list( dt = dt,pred = predictValues))
}
return(dt)
}
# Avoid error messages during CRAN check.

View File

@@ -36,7 +36,6 @@
#' # print the model without saving it to a file
#' print(xgb.dump(bst))
#' @export
#'
xgb.dump <- function(model = NULL, fname = NULL, fmap = "", with.stats=FALSE) {
if (class(model) != "xgb.Booster") {
stop("model: argument must be type xgb.Booster")

View File

@@ -1,7 +1,6 @@
#' Show importance of features in a model
#'
#' Read a xgboost model text dump.
#' Can be tree or linear model (text dump of linear model are only supported in dev version of \code{Xgboost} for now).
#' Create a \code{data.table} of the most important features of a model.
#'
#' @importFrom data.table data.table
#' @importFrom data.table setnames
@@ -11,34 +10,30 @@
#' @importFrom Matrix cBind
#' @importFrom Matrix sparseVector
#'
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#'
#' @param filename_dump the path to the text file storing the model. Model dump must include the gain per feature and per tree (\code{with.stats = T} in function \code{xgb.dump}).
#'
#' @param model generated by the \code{xgb.train} function. Avoid the creation of a dump file.
#'
#' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param model generated by the \code{xgb.train} function.
#' @param data the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.
#'
#' @param label the label vetor used for the training step. Will be used with \code{data} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.
#'
#' @param target a function which returns \code{TRUE} or \code{1} when an observation should be count as a co-occurence and \code{FALSE} or \code{0} otherwise. Default function is provided for computing co-occurences in a binary classification. The \code{target} function should have only one parameter. This parameter will be used to provide each important feature vector after having applied the split condition, therefore these vector will be only made of 0 and 1 only, whatever was the information before. More information in \code{Detail} part. This parameter is optional.
#'
#' @return A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
#'
#' @details
#' This is the function to understand the model trained (and through your model, your data).
#'
#' Results are returned for both linear and tree models.
#' This function is for both linear and tree models.
#'
#' \code{data.table} is returned by the function.
#' There are 3 columns :
#' The columns are :
#' \itemize{
#' \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump.
#' \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training ;
#' \item \code{Cover} metric of the number of observation related to this feature (only available for tree models) ;
#' \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees. \code{Gain} should be prefered to search the most important feature. For boosted linear model, this column has no meaning.
#' \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump;
#' \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training (only available for tree models);
#' \item \code{Cover} metric of the number of observation related to this feature (only available for tree models);
#' \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees.
#' }
#'
#' If you don't provide \code{feature_names}, index of the features will be used instead.
#'
#' Because the index is extracted from the model dump (made on the C++ side), it starts at 0 (usual in C++) instead of 1 (usual in R).
#'
#' Co-occurence count
#' ------------------
#'
@@ -51,35 +46,26 @@
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' # Both dataset are list with two items, a sparse matrix and labels
#' # (labels = outcome column which will be learned).
#' # Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#'
#' # train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.importance(train$data@@Dimnames[[2]], model = bst)
#' # agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.importance(agaricus.train$data@@Dimnames[[2]], model = bst)
#'
#' # Same thing with co-occurence computation this time
#' xgb.importance(train$data@@Dimnames[[2]], model = bst, data = train$data, label = train$label)
#' xgb.importance(agaricus.train$data@@Dimnames[[2]], model = bst, data = agaricus.train$data, label = agaricus.train$label)
#'
#' @export
xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = NULL, data = NULL, label = NULL, target = function(x) ((x + label) == 2)){
xgb.importance <- function(feature_names = NULL, model = NULL, data = NULL, label = NULL, target = function(x) ( (x + label) == 2)){
if (!class(feature_names) %in% c("character", "NULL")) {
stop("feature_names: Has to be a vector of character or NULL if the model dump already contains feature name. Look at this function documentation to see where to get feature names.")
stop("feature_names: Has to be a vector of character or NULL if the model already contains feature name. Look at this function documentation to see where to get feature names.")
}
if (!(class(filename_dump) %in% c("character", "NULL") && length(filename_dump) <= 1)) {
stop("filename_dump: Has to be a path to the model dump file.")
}
if (!class(model) %in% c("xgb.Booster", "NULL")) {
if (class(model) != "xgb.Booster") {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
}
if((is.null(data) & !is.null(label)) |(!is.null(data) & is.null(label))) {
if((is.null(data) & !is.null(label)) | (!is.null(data) & is.null(label))) {
stop("data/label: Provide the two arguments if you want co-occurence computation or none of them if you are not interested but not one of them only.")
}
@@ -87,17 +73,24 @@ xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = N
if(sum(label == 0) / length(label) > 0.5) label <- as(label, "sparseVector")
}
if(is.null(model)){
text <- readLines(filename_dump)
} else {
text <- xgb.dump(model = model, with.stats = T)
treeDump <- function(feature_names, text, keepDetail){
if(keepDetail) groupBy <- c("Feature", "Split", "MissingNo") else groupBy <- "Feature"
xgb.model.dt.tree(feature_names = feature_names, text = text)[,"MissingNo" := Missing == No ][Feature != "Leaf",.(Gain = sum(Quality), Cover = sum(Cover), Frequency = .N), by = groupBy, with = T][,`:=`(Gain = Gain / sum(Gain), Cover = Cover / sum(Cover), Frequency = Frequency / sum(Frequency))][order(Gain, decreasing = T)]
}
if(text[2] == "bias:"){
result <- readLines(filename_dump) %>% linearDump(feature_names, .)
linearDump <- function(feature_names, text){
weights <- which(text == "weight:") %>% {a =. + 1; text[a:length(text)]} %>% as.numeric
if(is.null(feature_names)) feature_names <- seq(to = length(weights))
data.table(Feature = feature_names, Weight = weights)
}
model.text.dump <- xgb.dump(model = model, with.stats = T)
if(model.text.dump[2] == "bias:"){
result <- model.text.dump %>% linearDump(feature_names, .)
if(!is.null(data) | !is.null(label)) warning("data/label: these parameters should only be provided with decision tree based models.")
} else {
result <- treeDump(feature_names, text = text, keepDetail = !is.null(data))
result <- treeDump(feature_names, text = model.text.dump, keepDetail = !is.null(data))
# Co-occurence computation
if(!is.null(data) & !is.null(label) & nrow(result) > 0) {
@@ -110,24 +103,12 @@ xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = N
d <- data[, result[,Feature], drop=FALSE] < as.numeric(result[,Split])
apply(c & d, 2, . %>% target %>% sum) -> vec
result <- result[, "RealCover":= as.numeric(vec), with = F][, "RealCover %" := RealCover / sum(label)][,MissingNo:=NULL]
result <- result[, "RealCover" := as.numeric(vec), with = F][, "RealCover %" := RealCover / sum(label)][,MissingNo := NULL]
}
}
result
}
treeDump <- function(feature_names, text, keepDetail){
if(keepDetail) groupBy <- c("Feature", "Split", "MissingNo") else groupBy <- "Feature"
result <- xgb.model.dt.tree(feature_names = feature_names, text = text)[,"MissingNo":= Missing == No ][Feature!="Leaf",.(Gain = sum(Quality), Cover = sum(Cover), Frequence = .N), by = groupBy, with = T][,`:=`(Gain = Gain/sum(Gain), Cover = Cover/sum(Cover), Frequence = Frequence/sum(Frequence))][order(Gain, decreasing = T)]
result
}
linearDump <- function(feature_names, text){
which(text == "weight:") %>% {a=.+1;text[a:length(text)]} %>% as.numeric %>% data.table(Feature = feature_names, Weight = .)
}
# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...

View File

@@ -15,7 +15,6 @@
#' bst <- xgb.load('xgb.model')
#' pred <- predict(bst, test$data)
#' @export
#'
xgb.load <- function(modelfile) {
if (is.null(modelfile))
stop("xgb.load: modelfile cannot be NULL")

View File

@@ -1,6 +1,6 @@
#' Convert tree model dump to data.table
#' Parse boosted tree model text dump
#'
#' Read a tree model text dump and return a data.table.
#' Parse a boosted tree model text dump and return a \code{data.table}.
#'
#' @importFrom data.table data.table
#' @importFrom data.table set
@@ -12,20 +12,20 @@
#' @importFrom magrittr add
#' @importFrom stringr str_extract
#' @importFrom stringr str_split
#' @importFrom stringr str_extract
#' @importFrom stringr str_trim
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param filename_dump the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}).
#' @param model dump generated by the \code{xgb.train} function. Avoid the creation of a dump file.
#' @param text dump generated by the \code{xgb.dump} function. Avoid the creation of a dump file. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}).
#' @param n_first_tree limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If the model already contains feature names, this argument should be \code{NULL} (default value).
#' @param model object created by the \code{xgb.train} function.
#' @param text \code{character} vector generated by the \code{xgb.dump} function. Model dump must include the gain per feature and per tree (parameter \code{with.stats = TRUE} in function \code{xgb.dump}).
#' @param n_first_tree limit the plot to the \code{n} first trees. If set to \code{NULL}, all trees of the model are plotted. Performance can be low depending of the size of the model.
#'
#' @return A \code{data.table} of the features used in the model with their gain, cover and few other thing.
#' @return A \code{data.table} of the features used in the model with their gain, cover and few other information.
#'
#' @details
#' General function to convert a text dump of tree model to a Matrix. The purpose is to help user to explore the model and get a better understanding of it.
#' General function to convert a text dump of tree model to a \code{data.table}.
#'
#' The content of the \code{data.table} is organised that way:
#' The purpose is to help user to explore the model and get a better understanding of it.
#'
#' The columns of the \code{data.table} are:
#'
#' \itemize{
#' \item \code{ID}: unique identifier of a node ;
@@ -37,56 +37,40 @@
#' \item \code{Quality}: it's the gain related to the split in this specific node ;
#' \item \code{Cover}: metric to measure the number of observation affected by the split ;
#' \item \code{Tree}: ID of the tree. It is included in the main ID ;
#' \item \code{Yes.X} or \code{No.X}: data related to the pointer in \code{Yes} or \code{No} column ;
#' \item \code{Yes.Feature}, \code{No.Feature}, \code{Yes.Cover}, \code{No.Cover}, \code{Yes.Quality} and \code{No.Quality}: data related to the pointer in \code{Yes} or \code{No} column ;
#' }
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' #Both dataset are list with two items, a sparse matrix and labels
#' #(labels = outcome column which will be learned).
#' #Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#'
#' #agaricus.test$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.model.dt.tree(agaricus.train$data@@Dimnames[[2]], model = bst)
#' # agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.model.dt.tree(feature_names = agaricus.train$data@@Dimnames[[2]], model = bst)
#'
#' @export
xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model = NULL, text = NULL, n_first_tree = NULL){
xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL, n_first_tree = NULL){
if (!class(feature_names) %in% c("character", "NULL")) {
stop("feature_names: Has to be a vector of character or NULL if the model dump already contains feature name. Look at this function documentation to see where to get feature names.")
}
if (!(class(filename_dump) %in% c("character", "NULL") && length(filename_dump) <= 1)) {
stop("filename_dump: Has to be a character vector of size 1 representing the path to the model dump file.")
} else if (!is.null(filename_dump) && !file.exists(filename_dump)) {
stop("filename_dump: path to the model doesn't exist.")
} else if(is.null(filename_dump) && is.null(model) && is.null(text)){
stop("filename_dump & model & text: no path to dump model, no model, no text dump, have been provided.")
}
if (!class(model) %in% c("xgb.Booster", "NULL")) {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
}
if (!class(text) %in% c("character", "NULL")) {
stop("text: Has to be a vector of character or NULL if a path to the model dump has already been provided.")
if (class(model) != "xgb.Booster" & class(text) != "character") {
"model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.\n" %>%
paste0("text: Has to be a vector of character or NULL if a path to the model dump has already been provided.") %>%
stop()
}
if (!class(n_first_tree) %in% c("numeric", "NULL") | length(n_first_tree) > 1) {
stop("n_first_tree: Has to be a numeric vector of size 1.")
}
if(!is.null(model)){
text = xgb.dump(model = model, with.stats = T)
} else if(!is.null(filename_dump)){
text <- readLines(filename_dump) %>% str_trim(side = "both")
if(is.null(text)){
text <- xgb.dump(model = model, with.stats = T)
}
position <- str_match(text, "booster") %>% is.na %>% not %>% which %>% c(length(text)+1)
position <- str_match(text, "booster") %>% is.na %>% not %>% which %>% c(length(text) + 1)
extract <- function(x, pattern) str_extract(x, pattern) %>% str_split("=") %>% lapply(function(x) x[2] %>% as.numeric) %>% unlist
@@ -96,15 +80,15 @@ xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model
allTrees <- data.table()
anynumber_regex<-"[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
for(i in 1:n_round){
anynumber_regex <- "[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]+)?"
for (i in 1:n_round){
tree <- text[(position[i]+1):(position[i+1]-1)]
tree <- text[(position[i] + 1):(position[i + 1] - 1)]
# avoid tree made of a leaf only (no split)
if(length(tree) <2) next
if(length(tree) < 2) next
treeID <- i-1
treeID <- i - 1
notLeaf <- str_match(tree, "leaf") %>% is.na
leaf <- notLeaf %>% not %>% tree[.]
@@ -128,38 +112,37 @@ xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model
qualityLeaf <- extract(leaf, paste0("leaf=",anynumber_regex))
coverBranch <- extract(branch, "cover=\\d*\\.*\\d*")
coverLeaf <- extract(leaf, "cover=\\d*\\.*\\d*")
dt <- data.table(ID = c(idBranch, idLeaf), Feature = c(featureBranch, featureLeaf), Split = c(splitBranch, splitLeaf), Yes = c(yesBranch, yesLeaf), No = c(noBranch, noLeaf), Missing = c(missingBranch, missingLeaf), Quality = c(qualityBranch, qualityLeaf), Cover = c(coverBranch, coverLeaf))[order(ID)][,Tree:=treeID]
dt <- data.table(ID = c(idBranch, idLeaf), Feature = c(featureBranch, featureLeaf), Split = c(splitBranch, splitLeaf), Yes = c(yesBranch, yesLeaf), No = c(noBranch, noLeaf), Missing = c(missingBranch, missingLeaf), Quality = c(qualityBranch, qualityLeaf), Cover = c(coverBranch, coverLeaf))[order(ID)][,Tree := treeID]
allTrees <- rbindlist(list(allTrees, dt), use.names = T, fill = F)
}
yes <- allTrees[!is.na(Yes),Yes]
yes <- allTrees[!is.na(Yes), Yes]
set(allTrees, i = which(allTrees[,Feature]!= "Leaf"),
set(allTrees, i = which(allTrees[, Feature] != "Leaf"),
j = "Yes.Feature",
value = allTrees[ID == yes,Feature])
value = allTrees[ID %in% yes, Feature])
set(allTrees, i = which(allTrees[,Feature]!= "Leaf"),
set(allTrees, i = which(allTrees[, Feature] != "Leaf"),
j = "Yes.Cover",
value = allTrees[ID == yes,Cover])
value = allTrees[ID %in% yes, Cover])
set(allTrees, i = which(allTrees[,Feature]!= "Leaf"),
j = "Yes.Quality",
value = allTrees[ID == yes,Quality])
set(allTrees, i = which(allTrees[, Feature] != "Leaf"),
j = "Yes.Quality",
value = allTrees[ID %in% yes, Quality])
no <- allTrees[!is.na(No), No]
no <- allTrees[!is.na(No),No]
set(allTrees, i = which(allTrees[,Feature]!= "Leaf"),
set(allTrees, i = which(allTrees[, Feature] != "Leaf"),
j = "No.Feature",
value = allTrees[ID == no,Feature])
value = allTrees[ID %in% no, Feature])
set(allTrees, i = which(allTrees[,Feature]!= "Leaf"),
set(allTrees, i = which(allTrees[, Feature] != "Leaf"),
j = "No.Cover",
value = allTrees[ID == no,Cover])
value = allTrees[ID %in% no, Cover])
set(allTrees, i = which(allTrees[,Feature]!= "Leaf"),
set(allTrees, i = which(allTrees[, Feature] != "Leaf"),
j = "No.Quality",
value = allTrees[ID == no,Quality])
value = allTrees[ID %in% no, Quality])
allTrees
}
@@ -167,4 +150,4 @@ xgb.model.dt.tree <- function(feature_names = NULL, filename_dump = NULL, model
# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...
globalVariables(c("ID", "Tree", "Yes", ".", ".N", "Feature", "Cover", "Quality", "No", "Gain", "Frequence"))
globalVariables(c("ID", "Tree", "Yes", ".", ".N", "Feature", "Cover", "Quality", "No", "Gain", "Frequency"))

View File

@@ -0,0 +1,160 @@
#' Plot multiple graphs at the same time
#'
#' Plot multiple graph aligned by rows and columns.
#'
#' @importFrom data.table data.table
#' @param cols number of columns
#' @return NULL
multiplot <- function(..., cols = 1) {
plots <- list(...)
numPlots = length(plots)
layout <- matrix(seq(1, cols * ceiling(numPlots / cols)),
ncol = cols, nrow = ceiling(numPlots / cols))
if (numPlots == 1) {
print(plots[[1]])
} else {
grid::grid.newpage()
grid::pushViewport(grid::viewport(layout = grid::grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.table(which(layout == i, arr.ind = TRUE))
print(
plots[[i]], vp = grid::viewport(
layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col
)
)
}
}
}
#' Parse the graph to extract vector of edges
#' @param element igraph object containing the path from the root to the leaf.
edge.parser <- function(element) {
edges.vector <- igraph::as_ids(element)
t <- tail(edges.vector, n = 1)
l <- length(edges.vector)
list(t,l)
}
#' Extract path from root to leaf from data.table
#' @param dt.tree data.table containing the nodes and edges of the trees
get.paths.to.leaf <- function(dt.tree) {
dt.not.leaf.edges <-
dt.tree[Feature != "Leaf",.(ID, Yes, Tree)] %>% list(dt.tree[Feature != "Leaf",.(ID, No, Tree)]) %>% rbindlist(use.names = F)
trees <- dt.tree[,unique(Tree)]
paths <- list()
for (tree in trees) {
graph <-
igraph::graph_from_data_frame(dt.not.leaf.edges[Tree == tree])
paths.tmp <-
igraph::shortest_paths(graph, from = paste0(tree, "-0"), to = dt.tree[Tree == tree &
Feature == "Leaf", c(ID)])
paths <- c(paths, paths.tmp$vpath)
}
paths
}
#' Plot model trees deepness
#'
#' Generate a graph to plot the distribution of deepness among trees.
#'
#' @importFrom data.table data.table
#' @importFrom data.table rbindlist
#' @importFrom data.table setnames
#' @importFrom data.table :=
#' @importFrom magrittr %>%
#' @param model dump generated by the \code{xgb.train} function.
#'
#' @return Two graphs showing the distribution of the model deepness.
#'
#' @details
#' Display both the number of \code{leaf} and the distribution of \code{weighted observations}
#' by tree deepness level.
#'
#' The purpose of this function is to help the user to find the best trade-off to set
#' the \code{max.depth} and \code{min_child_weight} parameters according to the bias / variance trade-off.
#'
#' See \link{xgb.train} for more information about these parameters.
#'
#' The graph is made of two parts:
#'
#' \itemize{
#' \item Count: number of leaf per level of deepness;
#' \item Weighted cover: noramlized weighted cover per leaf (weighted number of instances).
#' }
#'
#' This function is inspired by the blog post \url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
#' eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
#' min_child_weight = 50)
#'
#' xgb.plot.deepness(model = bst)
#'
#' @export
xgb.plot.deepness <- function(model = NULL) {
if (!requireNamespace("ggplot2", quietly = TRUE)) {
stop("ggplot2 package is required for plotting the graph deepness.",
call. = FALSE)
}
if (!requireNamespace("igraph", quietly = TRUE)) {
stop("igraph package is required for plotting the graph deepness.",
call. = FALSE)
}
if (!requireNamespace("grid", quietly = TRUE)) {
stop("grid package is required for plotting the graph deepness.",
call. = FALSE)
}
if (class(model) != "xgb.Booster") {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
}
dt.tree <- xgb.model.dt.tree(model = model)
dt.edge.elements <- data.table()
paths <- get.paths.to.leaf(dt.tree)
dt.edge.elements <-
lapply(paths, edge.parser) %>% rbindlist %>% setnames(c("last.edge", "size")) %>%
merge(dt.tree, by.x = "last.edge", by.y = "ID") %>% rbind(dt.edge.elements)
dt.edge.summuize <-
dt.edge.elements[, .(.N, Cover = sum(Cover)), size][,Cover:= Cover / sum(Cover)]
p1 <-
ggplot2::ggplot(dt.edge.summuize) + ggplot2::geom_line(ggplot2::aes(x = size, y = N, group = 1)) +
ggplot2::xlab("") + ggplot2::ylab("Count") + ggplot2::ggtitle("Model complexity") +
ggplot2::theme(
plot.title = ggplot2::element_text(lineheight = 0.9, face = "bold"),
panel.grid.major.y = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank(),
axis.text.x = ggplot2::element_blank()
)
p2 <-
ggplot2::ggplot(dt.edge.summuize) + ggplot2::geom_line(ggplot2::aes(x =size, y = Cover, group = 1)) +
ggplot2::xlab("From root to leaf path length") + ggplot2::ylab("Weighted cover")
multiplot(p1,p2,cols = 1)
}
# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...
globalVariables(
c(
"Feature", "Count", "ggplot", "aes", "geom_bar", "xlab", "ylab", "ggtitle", "theme", "element_blank", "element_text", "ID", "Yes", "No", "Tree"
)
)

View File

@@ -1,6 +1,6 @@
#' Plot feature importance bar graph
#'
#' Read a data.table containing feature importance details and plot it.
#' Read a data.table containing feature importance details and plot it (for both GLM and Trees).
#'
#' @importFrom magrittr %>%
#' @param importance_matrix a \code{data.table} returned by the \code{xgb.importance} function.
@@ -10,7 +10,7 @@
#'
#' @details
#' The purpose of this function is to easily represent the importance of each feature of a model.
#' The function return a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
#' The function returns a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
#' In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function.
#'
#' @examples
@@ -19,39 +19,61 @@
#' #Both dataset are list with two items, a sparse matrix and labels
#' #(labels = outcome column which will be learned).
#' #Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#'
#' #train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' importance_matrix <- xgb.importance(train$data@@Dimnames[[2]], model = bst)
#' #agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' importance_matrix <- xgb.importance(agaricus.train$data@@Dimnames[[2]], model = bst)
#' xgb.plot.importance(importance_matrix)
#'
#' @export
xgb.plot.importance <- function(importance_matrix = NULL, numberOfClusters = c(1:10)){
if (!"data.table" %in% class(importance_matrix)) {
stop("importance_matrix: Should be a data.table.")
xgb.plot.importance <-
function(importance_matrix = NULL, numberOfClusters = c(1:10)) {
if (!"data.table" %in% class(importance_matrix)) {
stop("importance_matrix: Should be a data.table.")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
stop("ggplot2 package is required for plotting the importance", call. = FALSE)
}
if (!requireNamespace("Ckmeans.1d.dp", quietly = TRUE)) {
stop("Ckmeans.1d.dp package is required for plotting the importance", call. = FALSE)
}
if(isTRUE(all.equal(colnames(importance_matrix), c("Feature", "Gain", "Cover", "Frequency")))){
y.axe.name <- "Gain"
} else if(isTRUE(all.equal(colnames(importance_matrix), c("Feature", "Weight")))){
y.axe.name <- "Weight"
} else {
stop("Importance matrix is not correct (column names issue)")
}
# To avoid issues in clustering when co-occurences are used
importance_matrix <-
importance_matrix[, .(Gain.or.Weight = sum(get(y.axe.name))), by = Feature]
clusters <-
suppressWarnings(Ckmeans.1d.dp::Ckmeans.1d.dp(importance_matrix[,Gain.or.Weight], numberOfClusters))
importance_matrix[,"Cluster":= clusters$cluster %>% as.character]
plot <-
ggplot2::ggplot(
importance_matrix, ggplot2::aes(
x = stats::reorder(Feature, Gain.or.Weight), y = Gain.or.Weight, width = 0.05
), environment = environment()
) + ggplot2::geom_bar(ggplot2::aes(fill = Cluster), stat = "identity", position =
"identity") + ggplot2::coord_flip() + ggplot2::xlab("Features") + ggplot2::ylab(y.axe.name) + ggplot2::ggtitle("Feature importance") + ggplot2::theme(
plot.title = ggplot2::element_text(lineheight = .9, face = "bold"), panel.grid.major.y = ggplot2::element_blank()
)
return(plot)
}
if (!require(ggplot2, quietly = TRUE)) {
stop("ggplot2 package is required for plotting the importance", call. = FALSE)
}
if (!requireNamespace("Ckmeans.1d.dp", quietly = TRUE)) {
stop("Ckmeans.1d.dp package is required for plotting the importance", call. = FALSE)
}
# To avoid issues in clustering when co-occurences are used
importance_matrix <- importance_matrix[, .(Gain = sum(Gain)), by = Feature]
clusters <- suppressWarnings(Ckmeans.1d.dp::Ckmeans.1d.dp(importance_matrix[,Gain], numberOfClusters))
importance_matrix[,"Cluster":=clusters$cluster %>% as.character]
plot <- ggplot(importance_matrix, aes(x=reorder(Feature, Gain), y = Gain, width= 0.05), environment = environment())+ geom_bar(aes(fill=Cluster), stat="identity", position="identity") + coord_flip() + xlab("Features") + ylab("Gain") + ggtitle("Feature importance") + theme(plot.title = element_text(lineheight=.9, face="bold"), panel.grid.major.y = element_blank() )
return(plot)
}
# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...
globalVariables(c("Feature", "Gain", "Cluster", "ggplot", "aes", "geom_bar", "coord_flip", "xlab", "ylab", "ggtitle", "theme", "element_blank", "element_text"))
globalVariables(
c(
"Feature", "Gain.or.Weight", "Cluster", "ggplot", "aes", "geom_bar", "coord_flip", "xlab", "ylab", "ggtitle", "theme", "element_blank", "element_text", "Gain.or.Weight"
)
)

View File

@@ -0,0 +1,114 @@
#' Project all trees on one tree and plot it
#'
#' Visualization of the ensemble of trees as a single collective unit.
#'
#' @importFrom data.table data.table
#' @importFrom data.table rbindlist
#' @importFrom data.table setnames
#' @importFrom data.table :=
#' @importFrom magrittr %>%
#' @importFrom stringr str_detect
#' @importFrom stringr str_extract
#'
#' @param model dump generated by the \code{xgb.train} function.
#' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param features.keep number of features to keep in each position of the multi trees.
#' @param plot.width width in pixels of the graph to produce
#' @param plot.height height in pixels of the graph to produce
#'
#' @return Two graphs showing the distribution of the model deepness.
#'
#' @details
#'
#' This function tries to capture the complexity of gradient boosted tree ensemble
#' in a cohesive way.
#'
#' The goal is to improve the interpretability of the model generally seen as black box.
#' The function is dedicated to boosting applied to decision trees only.
#'
#' The purpose is to move from an ensemble of trees to a single tree only.
#'
#' It takes advantage of the fact that the shape of a binary tree is only defined by
#' its deepness (therefore in a boosting model, all trees have the same shape).
#'
#' Moreover, the trees tend to reuse the same features.
#'
#' The function will project each tree on one, and keep for each position the
#' \code{features.keep} first features (based on Gain per feature measure).
#'
#' This function is inspired by this blog post:
#' \url{https://wellecks.wordpress.com/2015/02/21/peering-into-the-black-box-visualizing-lambdamart/}
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
#' eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
#' min_child_weight = 50)
#'
#' p <- xgb.plot.multi.trees(model = bst, feature_names = agaricus.train$data@Dimnames[[2]], features.keep = 3)
#' print(p)
#'
#' @export
xgb.plot.multi.trees <- function(model, feature_names = NULL, features.keep = 5, plot.width = NULL, plot.height = NULL){
tree.matrix <- xgb.model.dt.tree(feature_names = feature_names, model = model)
# first number of the path represents the tree, then the following numbers are related to the path to follow
# root init
root.nodes <- tree.matrix[str_detect(ID, "\\d+-0"), ID]
tree.matrix[ID %in% root.nodes, abs.node.position:=root.nodes]
precedent.nodes <- root.nodes
while(tree.matrix[,sum(is.na(abs.node.position))] > 0) {
yes.row.nodes <- tree.matrix[abs.node.position %in% precedent.nodes & !is.na(Yes)]
no.row.nodes <- tree.matrix[abs.node.position %in% precedent.nodes & !is.na(No)]
yes.nodes.abs.pos <- yes.row.nodes[, abs.node.position] %>% paste0("_0")
no.nodes.abs.pos <- no.row.nodes[, abs.node.position] %>% paste0("_1")
tree.matrix[ID %in% yes.row.nodes[, Yes], abs.node.position := yes.nodes.abs.pos]
tree.matrix[ID %in% no.row.nodes[, No], abs.node.position := no.nodes.abs.pos]
precedent.nodes <- c(yes.nodes.abs.pos, no.nodes.abs.pos)
}
tree.matrix[!is.na(Yes),Yes:= paste0(abs.node.position, "_0")]
tree.matrix[!is.na(No),No:= paste0(abs.node.position, "_1")]
remove.tree <- . %>% str_replace(pattern = "^\\d+-", replacement = "")
tree.matrix[,`:=`(abs.node.position=remove.tree(abs.node.position), Yes=remove.tree(Yes), No=remove.tree(No))]
nodes.dt <- tree.matrix[,.(Quality = sum(Quality)),by = .(abs.node.position, Feature)][,.(Text =paste0(Feature[1:min(length(Feature), features.keep)], " (", Quality[1:min(length(Quality), features.keep)], ")") %>% paste0(collapse = "\n")), by=abs.node.position]
edges.dt <- tree.matrix[Feature != "Leaf",.(abs.node.position, Yes)] %>% list(tree.matrix[Feature != "Leaf",.(abs.node.position, No)]) %>% rbindlist() %>% setnames(c("From", "To")) %>% .[,.N,.(From, To)] %>% .[,N:=NULL]
nodes <- DiagrammeR::create_nodes(nodes = nodes.dt[,abs.node.position],
label = nodes.dt[,Text],
style = "filled",
color = "DimGray",
fillcolor= "Beige",
shape = "oval",
fontname = "Helvetica"
)
edges <- DiagrammeR::create_edges(from = edges.dt[,From],
to = edges.dt[,To],
color = "DimGray",
arrowsize = "1.5",
arrowhead = "vee",
fontname = "Helvetica",
rel = "leading_to")
graph <- DiagrammeR::create_graph(nodes_df = nodes,
edges_df = edges,
graph_attrs = "rankdir = LR")
DiagrammeR::render_graph(graph, width = plot.width, height = plot.height)
}
globalVariables(
c(
"Feature", "no.nodes.abs.pos", "ID", "Yes", "No", "Tree", "yes.nodes.abs.pos", "abs.node.position"
)
)

View File

@@ -1,27 +1,15 @@
#' Plot a boosted tree model
#'
#' Read a tree model text dump.
#' Plotting only works for boosted tree model (not linear model).
#' Read a tree model text dump and plot the model.
#'
#' @importFrom data.table data.table
#' @importFrom data.table set
#' @importFrom data.table rbindlist
#' @importFrom data.table :=
#' @importFrom data.table copy
#' @importFrom magrittr %>%
#' @importFrom magrittr not
#' @importFrom magrittr add
#' @importFrom stringr str_extract
#' @importFrom stringr str_split
#' @importFrom stringr str_extract
#' @importFrom stringr str_trim
#' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param filename_dump the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}). Possible to provide a model directly (see \code{model} argument).
#' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
#' @param model generated by the \code{xgb.train} function. Avoid the creation of a dump file.
#' @param n_first_tree limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.
#' @param CSSstyle a \code{character} vector storing a css style to customize the appearance of nodes. Look at the \href{https://github.com/knsv/mermaid/wiki}{Mermaid wiki} for more information.
#' @param width the width of the diagram in pixels.
#' @param height the height of the diagram in pixels.
#' @param plot.width the width of the diagram in pixels.
#' @param plot.height the height of the diagram in pixels.
#'
#' @return A \code{DiagrammeR} of the model.
#'
@@ -30,37 +18,26 @@
#' The content of each node is organised that way:
#'
#' \itemize{
#' \item \code{feature} value ;
#' \item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be ;
#' \item \code{feature} value;
#' \item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be;
#' \item \code{gain}: metric the importance of the node in the model.
#' }
#'
#' Each branch finishes with a leaf. For each leaf, only the \code{cover} is indicated.
#' It uses \href{https://github.com/knsv/mermaid/}{Mermaid} library for that purpose.
#' The function uses \href{http://www.graphviz.org/}{GraphViz} library for that purpose.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#'
#' #Both dataset are list with two items, a sparse matrix and labels
#' #(labels = outcome column which will be learned).
#' #Each column of the sparse Matrix is a feature in one hot encoding format.
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
#' eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#'
#' #agaricus.test$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.plot.tree(agaricus.train$data@@Dimnames[[2]], model = bst)
#' # agaricus.train$data@@Dimnames[[2]] represents the column names of the sparse matrix.
#' xgb.plot.tree(feature_names = agaricus.train$data@@Dimnames[[2]], model = bst)
#'
#' @export
#'
xgb.plot.tree <- function(feature_names = NULL, filename_dump = NULL, model = NULL, n_first_tree = NULL, CSSstyle = NULL, width = NULL, height = NULL){
xgb.plot.tree <- function(feature_names = NULL, model = NULL, n_first_tree = NULL, plot.width = NULL, plot.height = NULL){
if (!(class(CSSstyle) %in% c("character", "NULL") && length(CSSstyle) <= 1)) {
stop("style: Has to be a character vector of size 1.")
}
if (!class(model) %in% c("xgb.Booster", "NULL")) {
if (class(model) != "xgb.Booster") {
stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
}
@@ -68,30 +45,40 @@ xgb.plot.tree <- function(feature_names = NULL, filename_dump = NULL, model = NU
stop("DiagrammeR package is required for xgb.plot.tree", call. = FALSE)
}
if(is.null(model)){
allTrees <- xgb.model.dt.tree(feature_names = feature_names, filename_dump = filename_dump, n_first_tree = n_first_tree)
} else {
allTrees <- xgb.model.dt.tree(feature_names = feature_names, model = model, n_first_tree = n_first_tree)
}
allTrees <- xgb.model.dt.tree(feature_names = feature_names, model = model, n_first_tree = n_first_tree)
allTrees[Feature!="Leaf" ,yesPath:= paste(ID,"(", Feature, "<br/>Cover: ", Cover, "<br/>Gain: ", Quality, ")-->|< ", Split, "|", Yes, ">", Yes.Feature, "]", sep = "")]
allTrees[, label:= paste0(Feature, "\nCover: ", Cover, "\nGain: ", Quality)]
allTrees[, shape:= "rectangle"][Feature == "Leaf", shape:= "oval"]
allTrees[, filledcolor:= "Beige"][Feature == "Leaf", filledcolor:= "Khaki"]
allTrees[Feature!="Leaf" ,noPath:= paste(ID,"(", Feature, ")-->|>= ", Split, "|", No, ">", No.Feature, "]", sep = "")]
# rev is used to put the first tree on top.
nodes <- DiagrammeR::create_nodes(nodes = allTrees[,ID] %>% rev,
label = allTrees[,label] %>% rev,
style = "filled",
color = "DimGray",
fillcolor= allTrees[,filledcolor] %>% rev,
shape = allTrees[,shape] %>% rev,
data = allTrees[,Feature] %>% rev,
fontname = "Helvetica"
)
edges <- DiagrammeR::create_edges(from = allTrees[Feature != "Leaf", c(ID)] %>% rep(2),
to = allTrees[Feature != "Leaf", c(Yes, No)],
label = allTrees[Feature != "Leaf", paste("<",Split)] %>% c(rep("",nrow(allTrees[Feature != "Leaf"]))),
color = "DimGray",
arrowsize = "1.5",
arrowhead = "vee",
fontname = "Helvetica",
rel = "leading_to")
if(is.null(CSSstyle)){
CSSstyle <- "classDef greenNode fill:#A2EB86, stroke:#04C4AB, stroke-width:2px;classDef redNode fill:#FFA070, stroke:#FF5E5E, stroke-width:2px"
}
graph <- DiagrammeR::create_graph(nodes_df = nodes,
edges_df = edges,
graph_attrs = "rankdir = LR")
yes <- allTrees[Feature!="Leaf", c(Yes)] %>% paste(collapse = ",") %>% paste("class ", ., " greenNode", sep = "")
no <- allTrees[Feature!="Leaf", c(No)] %>% paste(collapse = ",") %>% paste("class ", ., " redNode", sep = "")
path <- allTrees[Feature!="Leaf", c(yesPath, noPath)] %>% .[order(.)] %>% paste(sep = "", collapse = ";") %>% paste("graph LR", .,collapse = "", sep = ";") %>% paste(CSSstyle, yes, no, sep = ";")
DiagrammeR::mermaid(path, width, height)
DiagrammeR::render_graph(graph, width = plot.width, height = plot.height)
}
# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...
globalVariables(c("Feature", "yesPath", "ID", "Cover", "Quality", "Split", "Yes", "Yes.Feature", "noPath", "No", "No.Feature", "."))
globalVariables(c("Feature", "ID", "Cover", "Quality", "Split", "Yes", "No", ".", "shape", "filledcolor", "label"))

View File

@@ -16,7 +16,6 @@
#' bst <- xgb.load('xgb.model')
#' pred <- predict(bst, test$data)
#' @export
#'
xgb.save <- function(model, fname) {
if (typeof(fname) != "character") {
stop("xgb.save: fname must be character")

View File

@@ -16,7 +16,6 @@
#' bst <- xgb.load(raw)
#' pred <- predict(bst, test$data)
#' @export
#'
xgb.save.raw <- function(model) {
if (class(model) == "xgb.Booster"){
model <- model$handle

View File

@@ -19,7 +19,7 @@
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
@@ -36,19 +36,19 @@
#' 3. Task Parameters
#'
#' \itemize{
#' \item \code{objective} specify the learning task and the corresponding learning objective, and the objective options are below:
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \itemize{
#' \item \code{reg:linear} linear regression (Default).
#' \item \code{reg:logistic} logistic regression.
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{num_class} set the number of classes. To use only with multiclass objectives.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{tonum_class}.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class}.
#' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
#' }
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
#' \item \code{eval_metric} evaluation metrics for validation data. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
#' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
#' }
#'
#' @param data takes an \code{xgb.DMatrix} as the input.
@@ -66,13 +66,14 @@
#' prediction and dtrain,
#' @param verbose If 0, xgboost will stay silent. If 1, xgboost will print
#' information of performance. If 2, xgboost will print information of both
#' @param printEveryN Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.
#' @param early_stop_round If \code{NULL}, the early stopping function is not triggered.
#' @param print.every.n Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.
#' @param early.stop.round If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' keeps getting worse consecutively for \code{k} rounds.
#' @param early.stop.round An alternative of \code{early_stop_round}.
#' @param maximize If \code{feval} and \code{early_stop_round} are set, then \code{maximize} must be set as well.
#' @param maximize If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
#' \code{maximize=TRUE} means the larger the evaluation score the better.
#' @param save_period save the model to the disk in every \code{save_period} rounds, 0 means no such action.
#' @param save_name the name or path for periodically saved model file.
#' @param ... other parameters to pass to \code{params}.
#'
#' @details
@@ -88,6 +89,7 @@
#' \itemize{
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
#' \item \code{mlogloss} multiclass logloss. \url{https://www.kaggle.com/wiki/MultiClassLogLoss}
#' \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
@@ -103,7 +105,6 @@
#' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
#' dtest <- dtrain
#' watchlist <- list(eval = dtest, train = dtrain)
#' param <- list(max.depth = 2, eta = 1, silent = 1)
#' logregobj <- function(preds, dtrain) {
#' labels <- getinfo(dtrain, "label")
#' preds <- 1/(1 + exp(-preds))
@@ -116,13 +117,13 @@
#' err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
#' return(list(metric = "error", value = err))
#' }
#' bst <- xgb.train(param, dtrain, nthread = 2, nround = 2, watchlist, logregobj, evalerror)
#' param <- list(max.depth = 2, eta = 1, silent = 1, objective=logregobj,eval_metric=evalerror)
#' bst <- xgb.train(param, dtrain, nthread = 2, nround = 2, watchlist)
#' @export
#'
xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
obj = NULL, feval = NULL, verbose = 1, printEveryN=1L,
early_stop_round = NULL, early.stop.round = NULL,
maximize = NULL, ...) {
obj = NULL, feval = NULL, verbose = 1, print.every.n=1L,
early.stop.round = NULL, maximize = NULL,
save_period = 0, save_name = "xgboost.model", ...) {
dtrain <- data
if (typeof(params) != "list") {
stop("xgb.train: first argument params must be list")
@@ -137,14 +138,34 @@ xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
}
if (length(watchlist) != 0 && verbose == 0) {
warning('watchlist is provided but verbose=0, no evaluation information will be printed')
watchlist <- list()
}
params = append(params, list(...))
fit.call <- match.call()
dot.params <- list(...)
nms.params <- names(params)
nms.dot.params <- names(dot.params)
if (length(intersect(nms.params,nms.dot.params)) > 0)
stop("Duplicated term in parameters. Please check your list of params.")
params <- append(params, dot.params)
# customized objective and evaluation metric interface
if (!is.null(params$objective) && !is.null(obj))
stop("xgb.train: cannot assign two different objectives")
if (!is.null(params$objective))
if (class(params$objective) == 'function') {
obj <- params$objective
params$objective <- NULL
}
if (!is.null(params$eval_metric) && !is.null(feval))
stop("xgb.train: cannot assign two different evaluation metrics")
if (!is.null(params$eval_metric))
if (class(params$eval_metric) == 'function') {
feval <- params$eval_metric
params$eval_metric <- NULL
}
# Early stopping
if (is.null(early_stop_round) && !is.null(early.stop.round))
early_stop_round = early.stop.round
if (!is.null(early_stop_round)){
if (!is.null(early.stop.round)){
if (!is.null(feval) && is.null(maximize))
stop('Please set maximize to note whether the model is maximizing the evaluation or not.')
if (length(watchlist) == 0)
@@ -154,55 +175,63 @@ xgb.train <- function(params=list(), data, nrounds, watchlist = list(),
if (is.null(maximize))
{
if (params$eval_metric %in% c('rmse','logloss','error','merror','mlogloss')) {
maximize = FALSE
maximize <- FALSE
} else {
maximize = TRUE
maximize <- TRUE
}
}
if (maximize) {
bestScore = 0
bestScore <- 0
} else {
bestScore = Inf
bestScore <- Inf
}
bestInd = 0
bestInd <- 0
earlyStopflag = FALSE
if (length(watchlist)>1)
if (length(watchlist) > 1)
warning('Only the first data set in watchlist is used for early stopping process.')
}
handle <- xgb.Booster(params, append(watchlist, dtrain))
bst <- xgb.handleToBooster(handle)
printEveryN=max( as.integer(printEveryN), 1L)
print.every.n <- max( as.integer(print.every.n), 1L)
for (i in 1:nrounds) {
succ <- xgb.iter.update(bst$handle, dtrain, i - 1, obj)
if (length(watchlist) != 0) {
msg <- xgb.iter.eval(bst$handle, watchlist, i - 1, feval)
if (0== ( (i-1) %% printEveryN))
cat(paste(msg, "\n", sep=""))
if (!is.null(early_stop_round))
if (0 == ( (i - 1) %% print.every.n))
cat(paste(msg, "\n", sep = ""))
if (!is.null(early.stop.round))
{
score = strsplit(msg,':|\\s+')[[1]][3]
score = as.numeric(score)
if ((maximize && score>bestScore) || (!maximize && score<bestScore)) {
bestScore = score
bestInd = i
score <- strsplit(msg,':|\\s+')[[1]][3]
score <- as.numeric(score)
if ( (maximize && score > bestScore) || (!maximize && score < bestScore)) {
bestScore <- score
bestInd <- i
} else {
if (i-bestInd>=early_stop_round) {
earlyStopflag = TRUE
cat('Stopping. Best iteration:',bestInd)
earlyStopflag = TRUE
if (i - bestInd >= early.stop.round) {
cat('Stopping. Best iteration:', bestInd, '\n')
break
}
}
}
}
if (save_period > 0) {
if (i %% save_period == 0) {
xgb.save(bst, save_name)
}
}
}
bst <- xgb.Booster.check(bst)
if (!is.null(early_stop_round)) {
bst$bestScore = bestScore
bst$bestInd = bestInd
if (!is.null(early.stop.round)) {
bst$bestScore <- bestScore
bst$bestInd <- bestInd
}
attr(bst, "call") <- fit.call
attr(bst, "params") <- params
return(bst)
}

View File

@@ -28,15 +28,17 @@
#' @param verbose If 0, xgboost will stay silent. If 1, xgboost will print
#' information of performance. If 2, xgboost will print information of both
#' performance and construction progress information
#' @param printEveryN Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.
#' @param print.every.n Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.
#' @param missing Missing is only used when input is dense matrix, pick a float
#' value that represents missing value. Sometimes a data use 0 or other extreme value to represents missing values.
#' @param early_stop_round If \code{NULL}, the early stopping function is not triggered.
#' @param weight a vector indicating the weight for each row of the input.
#' @param early.stop.round If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' keeps getting worse consecutively for \code{k} rounds.
#' @param early.stop.round An alternative of \code{early_stop_round}.
#' @param maximize If \code{feval} and \code{early_stop_round} are set, then \code{maximize} must be set as well.
#' @param maximize If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
#' \code{maximize=TRUE} means the larger the evaluation score the better.
#' @param save_period save the model to the disk in every \code{save_period} rounds, 0 means no such action.
#' @param save_name the name or path for periodically saved model file.
#' @param ... other parameters to pass to \code{params}.
#'
#' @details
@@ -56,15 +58,11 @@
#' pred <- predict(bst, test$data)
#'
#' @export
#'
xgboost <- function(data = NULL, label = NULL, missing = NULL, params = list(), nrounds,
verbose = 1, printEveryN=1L, early_stop_round = NULL, early.stop.round = NULL,
maximize = NULL, ...) {
if (is.null(missing)) {
dtrain <- xgb.get.DMatrix(data, label)
} else {
dtrain <- xgb.get.DMatrix(data, label, missing)
}
xgboost <- function(data = NULL, label = NULL, missing = NA, weight = NULL,
params = list(), nrounds,
verbose = 1, print.every.n = 1L, early.stop.round = NULL,
maximize = NULL, save_period = 0, save_name = "xgboost.model", ...) {
dtrain <- xgb.get.DMatrix(data, label, missing, weight)
params <- append(params, list(...))
@@ -74,14 +72,12 @@ xgboost <- function(data = NULL, label = NULL, missing = NULL, params = list(),
watchlist <- list()
}
bst <- xgb.train(params, dtrain, nrounds, watchlist, verbose = verbose, printEveryN=printEveryN,
early_stop_round = early_stop_round,
early.stop.round = early.stop.round)
bst <- xgb.train(params, dtrain, nrounds, watchlist, verbose = verbose, print.every.n=print.every.n,
early.stop.round = early.stop.round, maximize = maximize,
save_period = save_period, save_name = save_name)
return(bst)
}
#' Training part from Mushroom Data Set
#'
#' This data set is originally from the Mushroom data set,

View File

@@ -1,20 +1,44 @@
# R package for xgboost.
R package for xgboost
=====================
## Installation
[![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost)
[![CRAN Downloads](http://cranlogs.r-pkg.org/badges/xgboost)](http://cran.rstudio.com/web/packages/xgboost/index.html)
For up-to-date version (which is recommended), please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
Installation
------------
```r
devtools::install_github('dmlc/xgboost',subdir='R-package')
```
For stable version on CRAN, please run
We are [on CRAN](https://cran.r-project.org/web/packages/xgboost/index.html) now. For stable/pre-compiled(for Windows and OS X) version, please install from CRAN:
```r
install.packages('xgboost')
```
## Examples
For up-to-date version, please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
```r
devtools::install_github('dmlc/xgboost',subdir='R-package')
```
Examples
--------
* Please visit [walk through example](demo).
* See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.R) on this dataset and the one related to [Otto challenge](../demo/kaggle-otto), including a [RMarkdown documentation](../demo/kaggle-otto/understandingXGBoostModel.Rmd).
Notes
-----
If you face an issue installing the package using ```devtools::install_github```, something like this (even after updating libxml and RCurl as lot of forums say) -
```
devtools::install_github('dmlc/xgboost',subdir='R-package')
Downloading github repo dmlc/xgboost@master
Error in function (type, msg, asError = TRUE) :
Peer certificate cannot be authenticated with given CA certificates
```
To get around this you can build the package locally as mentioned [here](https://github.com/dmlc/xgboost/issues/347) -
```
1. Clone the current repository and set your workspace to xgboost/R-package/
2. Run R CMD INSTALL --build . in terminal to get the tarball.
3. Run install.packages('path_to_the_tarball',repo=NULL) in R to install.
```

View File

@@ -1,4 +1,5 @@
basic_walkthrough Basic feature walkthrough
caret_wrapper Use xgboost to train in caret library
custom_objective Cutomize loss function, and evaluation metric
boost_from_prediction Boosting from existing prediction
predict_first_ntree Predicting using first n trees

View File

@@ -1,6 +1,7 @@
XGBoost R Feature Walkthrough
====
* [Basic walkthrough of wrappers](basic_walkthrough.R)
* [Train a xgboost model from caret library](caret_wrapper.R)
* [Cutomize loss function, and evaluation metric](custom_objective.R)
* [Boosting from existing prediction](boost_from_prediction.R)
* [Predicting using first n trees](predict_first_ntree.R)

View File

@@ -1,7 +1,7 @@
require(xgboost)
require(methods)
# we load in the agaricus dataset
# In this example, we are aiming to predict whether a mushroom can be eated
# In this example, we are aiming to predict whether a mushroom can be eaten
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
@@ -12,30 +12,30 @@ class(train$data)
#-------------Basic Training using XGBoost-----------------
# this is the basic usage of xgboost you can put matrix in data field
# note: we are puting in sparse matrix here, xgboost naturally handles sparse input
# use sparse matrix when your feature is sparse(e.g. when you using one-hot encoding vector)
print("training xgboost with sparseMatrix")
# note: we are putting in sparse matrix here, xgboost naturally handles sparse input
# use sparse matrix when your feature is sparse(e.g. when you are using one-hot encoding vector)
print("Training xgboost with sparseMatrix")
bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic")
# alternatively, you can put in dense matrix, i.e. basic R-matrix
print("training xgboost with Matrix")
print("Training xgboost with Matrix")
bst <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic")
# you can also put in xgb.DMatrix object, stores label, data and other meta datas needed for advanced features
print("training xgboost with xgb.DMatrix")
# you can also put in xgb.DMatrix object, which stores label, data and other meta datas needed for advanced features
print("Training xgboost with xgb.DMatrix")
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, nthread = 2,
objective = "binary:logistic")
# Verbose = 0,1,2
print ('train xgboost with verbose 0, no message')
print("Train xgboost with verbose 0, no message")
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic", verbose = 0)
print ('train xgboost with verbose 1, print evaluation metric')
print("Train xgboost with verbose 1, print evaluation metric")
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic", verbose = 1)
print ('train xgboost with verbose 2, also print information about tree')
print("Train xgboost with verbose 2, also print information about tree")
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
nthread = 2, objective = "binary:logistic", verbose = 2)
@@ -72,15 +72,15 @@ print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
#---------------Using watchlist----------------
# watchlist is a list of xgb.DMatrix, each of them tagged with name
# watchlist is a list of xgb.DMatrix, each of them is tagged with name
watchlist <- list(train=dtrain, test=dtest)
# to train with watchlist, use xgb.train, which contains more advanced features
# watchlist allows us to monitor the evaluation result on all data in the list
print ('train xgboost using xgb.train with watchlist')
print("Train xgboost using xgb.train with watchlist")
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist,
nthread = 2, objective = "binary:logistic")
# we can change evaluation metrics, or use multiple evaluation metrics
print ('train xgboost using xgb.train with watchlist, watch logloss and error')
print("train xgboost using xgb.train with watchlist, watch logloss and error")
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist,
eval.metric = "error", eval.metric = "logloss",
nthread = 2, objective = "binary:logistic")
@@ -102,4 +102,9 @@ xgb.dump(bst, "dump.raw.txt", with.stats = T)
# Finally, you can check which features are the most important.
print("Most important features (look at column Gain):")
print(xgb.importance(feature_names = train$data@Dimnames[[2]], filename_dump = "dump.raw.txt"))
imp_matrix <- xgb.importance(feature_names = train$data@Dimnames[[2]], model = bst)
print(imp_matrix)
# Feature importance bar plot by gain
print("Feature importance Plot : ")
print(xgb.plot.importance(importance_matrix = imp_matrix))

View File

@@ -23,4 +23,4 @@ setinfo(dtrain, "base_margin", ptrain)
setinfo(dtest, "base_margin", ptest)
print('this is result of boost from initial prediction')
bst <- xgb.train( param, dtrain, 1, watchlist )
bst <- xgb.train(params = param, data = dtrain, nrounds = 1, watchlist = watchlist)

View File

@@ -0,0 +1,35 @@
# install development version of caret library that contains xgboost models
devtools::install_github("topepo/caret/pkg/caret")
require(caret)
require(xgboost)
require(data.table)
require(vcd)
require(e1071)
# Load Arthritis dataset in memory.
data(Arthritis)
# Create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df <- data.table(Arthritis, keep.rownames = F)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
df[,AgeDiscret:= as.factor(round(Age/10,0))]
# Here is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))]
# We remove ID as there is nothing to learn from this feature (it will just add some noise as the dataset is small).
df[,ID:=NULL]
#-------------Basic Training using XGBoost in caret Library-----------------
# Set up control parameters for caret::train
# Here we use 10-fold cross-validation, repeating twice, and using random search for tuning hyper-parameters.
fitControl <- trainControl(method = "cv", number = 10, repeats = 2, search = "random")
# train a xgbTree model using caret::train
model <- train(factor(Improved)~., data = df, method = "xgbTree", trControl = fitControl)
# Instead of tree for our boosters, you can also fit a linear regression or logistic regression model using xgbLinear
# model <- train(factor(Improved)~., data = df, method = "xgbLinear", trControl = fitControl)
# See model results
print(model)

View File

@@ -1,11 +1,13 @@
require(xgboost)
require(Matrix)
require(data.table)
if (!require(vcd)) install.packages('vcd') #Available in Cran. Used for its dataset with categorical values.
if (!require(vcd)) {
install.packages('vcd') #Available in Cran. Used for its dataset with categorical values.
require(vcd)
}
# According to its documentation, Xgboost works only on numbers.
# Sometimes the dataset we have to work on have categorical data.
# A categorical variable is one which have a fixed number of values. By exemple, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
# A categorical variable is one which have a fixed number of values. By example, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
#
# In R, categorical variable is called Factor.
# Type ?factor in console for more information.
@@ -65,18 +67,17 @@ output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y]
cat("Learning...\n")
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 9,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
xgb.dump(bst, 'xgb.model.dump', with.stats = T)
# sparse_matrix@Dimnames[[2]] represents the column names of the sparse matrix.
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], 'xgb.model.dump')
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
print(importance)
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).
# Does these results make sense?
# Does these result make sense?
# Let's check some Chi2 between each of these features and the outcome.
print(chisq.test(df$Age, df$Y))
# Pearson correlation between Age and illness disapearing is 35
# Pearson correlation between Age and illness disappearing is 35
print(chisq.test(df$AgeDiscret, df$Y))
# Our first simplification of Age gives a Pearson correlation of 8.
@@ -84,6 +85,6 @@ print(chisq.test(df$AgeDiscret, df$Y))
print(chisq.test(df$AgeCat, df$Y))
# The perfectly random split I did between young and old at 30 years old have a low correlation of 2. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Don't let your "gut" lower the quality of your model. In "data science", there is science :-)
# As you can see, in general destroying information by simplying it won't improve your model. Chi2 just demonstrates that. But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
# As you can see, in general destroying information by simplifying it won't improve your model. Chi2 just demonstrates that. But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
# However it's almost always worse when you add some arbitrary rules.
# Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age. Linear model may not be that strong in these scenario.

View File

@@ -40,12 +40,12 @@ evalerror <- function(preds, dtrain) {
return(list(metric = "error", value = err))
}
param <- list(max.depth=2,eta=1,silent=1)
param <- list(max.depth=2,eta=1,silent=1,
objective = logregobj, eval_metric = evalerror)
# train with customized objective
xgb.cv(param, dtrain, nround, nfold = 5,
obj = logregobj, feval=evalerror)
xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5)
# do cross validation with prediction values for each fold
res <- xgb.cv(param, dtrain, nround, nfold=5, prediction = TRUE)
res <- xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5, prediction = TRUE)
res$dt
length(res$pred)

View File

@@ -8,7 +8,6 @@ dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
param <- list(max.depth=2,eta=1,nthread = 2, silent=1)
watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2
@@ -33,10 +32,13 @@ evalerror <- function(preds, dtrain) {
err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
param <- list(max.depth=2, eta=1, nthread = 2, silent=1,
objective=logregobj, eval_metric=evalerror)
print ('start training with user customized objective')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
bst <- xgb.train(param, dtrain, num_round, watchlist)
#
# there can be cases where you want additional information
@@ -55,8 +57,9 @@ logregobjattr <- function(preds, dtrain) {
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
param <- list(max.depth=2, eta=1, nthread = 2, silent=1,
objective=logregobjattr, eval_metric=evalerror)
print ('start training with user customized objective, with additional attributes in DMatrix')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist, logregobjattr, evalerror)
bst <- xgb.train(param, dtrain, num_round, watchlist)

View File

@@ -31,9 +31,10 @@ evalerror <- function(preds, dtrain) {
return(list(metric = "error", value = err))
}
print ('start training with early Stopping setting')
# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train
bst <- xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror, maximize = FALSE,
bst <- xgb.train(param, dtrain, num_round, watchlist,
objective = logregobj, eval_metric = evalerror, maximize = FALSE,
early.stop.round = 3)
bst <- xgb.cv(param, dtrain, num_round, nfold=5, obj=logregobj, feval = evalerror,
bst <- xgb.cv(param, dtrain, num_round, nfold = 5,
objective = logregobj, eval_metric = evalerror,
maximize = FALSE, early.stop.round = 3)

View File

@@ -1,21 +1,52 @@
require(xgboost)
require(data.table)
require(Matrix)
set.seed(1982)
# load in the agaricus dataset
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
param <- list(max.depth=2,eta=1,silent=1,objective='binary:logistic')
watchlist <- list(eval = dtest, train = dtrain)
nround = 5
param <- list(max.depth=2, eta=1, silent=1, objective='binary:logistic')
nround = 4
# training the model for two rounds
bst = xgb.train(param, dtrain, nround, nthread = 2, watchlist)
cat('start testing prediction from first n trees\n')
bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
# Model accuracy without new features
accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
### predict using first 2 tree
pred_with_leaf = predict(bst, dtest, ntreelimit = 2, predleaf = TRUE)
head(pred_with_leaf)
# by default, we predict using all the trees
pred_with_leaf = predict(bst, dtest, predleaf = TRUE)
head(pred_with_leaf)
create.new.tree.features <- function(model, original.features){
pred_with_leaf <- predict(model, original.features, predleaf = TRUE)
cols <- list()
for(i in 1:length(trees)){
# max is not the real max but it s not important for the purpose of adding features
leaf.id <- sort(unique(pred_with_leaf[,i]))
cols[[i]] <- factor(x = pred_with_leaf[,i], level = leaf.id)
}
cBind(original.features, sparse.model.matrix( ~ . -1, as.data.frame(cols)))
}
# Convert previous features to one hot encoding
new.features.train <- create.new.tree.features(bst, agaricus.train$data)
new.features.test <- create.new.tree.features(bst, agaricus.test$data)
# learning with new features
new.dtrain <- xgb.DMatrix(data = new.features.train, label = agaricus.train$label)
new.dtest <- xgb.DMatrix(data = new.features.test, label = agaricus.test$label)
watchlist <- list(train = new.dtrain)
bst <- xgb.train(params = param, data = new.dtrain, nrounds = nround, nthread = 2)
# Model accuracy with new features
accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
# Here the accuracy was already good and is now perfect.
cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\n"))

View File

@@ -9,3 +9,4 @@ demo(create_sparse_matrix)
demo(predict_leaf_indices)
demo(early_stopping)
demo(poisson_regression)
demo(caret_wrapper)

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgboost.R
\docType{data}
\name{agaricus.test}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgboost.R
\docType{data}
\name{agaricus.train}

View File

@@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{edge.parser}
\alias{edge.parser}
\title{Parse the graph to extract vector of edges}
\usage{
edge.parser(element)
}
\arguments{
\item{element}{igraph object containing the path from the root to the leaf.}
}
\description{
Parse the graph to extract vector of edges
}

View File

@@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{get.paths.to.leaf}
\alias{get.paths.to.leaf}
\title{Extract path from root to leaf from data.table}
\usage{
get.paths.to.leaf(dt.tree)
}
\arguments{
\item{dt.tree}{data.table containing the nodes and edges of the trees}
}
\description{
Extract path from root to leaf from data.table
}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/getinfo.xgb.DMatrix.R
\docType{methods}
\name{getinfo}

View File

@@ -0,0 +1,15 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{multiplot}
\alias{multiplot}
\title{Plot multiple graphs at the same time}
\usage{
multiplot(..., cols = 1)
}
\arguments{
\item{cols}{number of columns}
}
\description{
Plot multiple graph aligned by rows and columns.
}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/nrow.xgb.DMatrix.R
\docType{methods}
\name{nrow,xgb.DMatrix-method}
@@ -18,5 +18,6 @@ data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
stopifnot(nrow(dtrain) == nrow(train$data))
}

View File

@@ -1,11 +1,11 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/predict.xgb.Booster.R
\docType{methods}
\name{predict,xgb.Booster-method}
\alias{predict,xgb.Booster-method}
\title{Predict method for eXtreme Gradient Boosting model}
\usage{
\S4method{predict}{xgb.Booster}(object, newdata, missing = NULL,
\S4method{predict}{xgb.Booster}(object, newdata, missing = NA,
outputmargin = FALSE, ntreelimit = NULL, predleaf = FALSE)
}
\arguments{
@@ -31,6 +31,16 @@ than 0. It will use all trees by default.}
\description{
Predicted values based on xgboost model object.
}
\details{
The option \code{ntreelimit} purpose is to let the user train a model with lots
of trees but use only the first trees for prediction to avoid overfitting
(without having to train a new model with less trees).
The option \code{predleaf} purpose is inspired from §3.1 of the paper
\code{Practical Lessons from Predicting Clicks on Ads at Facebook}.
The idea is to use the model as a generator of new features which capture non linear link
from original features.
}
\examples{
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/predict.xgb.Booster.handle.R
\docType{methods}
\name{predict,xgb.Booster.handle-method}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/setinfo.xgb.DMatrix.R
\docType{methods}
\name{setinfo}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/slice.xgb.DMatrix.R
\docType{methods}
\name{slice}

View File

@@ -1,10 +1,10 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.DMatrix.R
\name{xgb.DMatrix}
\alias{xgb.DMatrix}
\title{Contruct xgb.DMatrix object}
\usage{
xgb.DMatrix(data, info = list(), missing = 0, ...)
xgb.DMatrix(data, info = list(), missing = NA, ...)
}
\arguments{
\item{data}{a \code{matrix} object, a \code{dgCMatrix} object or a character

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.DMatrix.save.R
\name{xgb.DMatrix.save}
\alias{xgb.DMatrix.save}

View File

@@ -0,0 +1,88 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.create.features.R
\name{xgb.create.features}
\alias{xgb.create.features}
\title{Create new features from a previously learned model}
\usage{
xgb.create.features(model, training.data)
}
\arguments{
\item{model}{decision tree boosting model learned on the original data}
\item{training.data}{original data (usually provided as a \code{dgCMatrix} matrix)}
}
\value{
\code{dgCMatrix} matrix including both the original data and the new features.
}
\description{
May improve the learning by adding new features to the training data based on the decision trees from a previously learned model.
}
\details{
This is the function inspired from the paragraph 3.1 of the paper:
\strong{Practical Lessons from Predicting Clicks on Ads at Facebook}
\emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers,
Joaquin Quiñonero Candela)}
International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
\url{https://research.facebook.com/publications/758569837499391/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
Extract explaining the method:
"\emph{We found that boosted decision trees are a powerful and very
convenient way to implement non-linear and tuple transformations
of the kind we just described. We treat each individual
tree as a categorical feature that takes as value the
index of the leaf an instance ends up falling in. We use
1-of-K coding of this type of features.
For example, consider the boosted tree model in Figure 1 with 2 subtrees,
where the first subtree has 3 leafs and the second 2 leafs. If an
instance ends up in leaf 2 in the first subtree and leaf 1 in
second subtree, the overall input to the linear classifier will
be the binary vector \code{[0, 1, 0, 1, 0]}, where the first 3 entries
correspond to the leaves of the first subtree and last 2 to
those of the second subtree.
[...]
We can understand boosted decision tree
based transformation as a supervised feature encoding that
converts a real-valued vector into a compact binary-valued
vector. A traversal from root node to a leaf node represents
a rule on certain features.}"
}
\examples{
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
param <- list(max.depth=2, eta=1, silent=1, objective='binary:logistic')
nround = 4
bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
# Model accuracy without new features
accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
# Convert previous features to one hot encoding
new.features.train <- xgb.create.features(model = bst, agaricus.train$data)
new.features.test <- xgb.create.features(model = bst, agaricus.test$data)
# learning with new features
new.dtrain <- xgb.DMatrix(data = new.features.train, label = agaricus.train$label)
new.dtest <- xgb.DMatrix(data = new.features.test, label = agaricus.test$label)
watchlist <- list(train = new.dtrain)
bst <- xgb.train(params = param, data = new.dtrain, nrounds = nround, nthread = 2)
# Model accuracy with new features
accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label)
# Here the accuracy was already good and is now perfect.
cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\\n"))
}

View File

@@ -1,14 +1,13 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.cv.R
\name{xgb.cv}
\alias{xgb.cv}
\title{Cross Validation}
\usage{
xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
missing = NULL, prediction = FALSE, showsd = TRUE, metrics = list(),
obj = NULL, feval = NULL, stratified = TRUE, folds = NULL,
verbose = T, early_stop_round = NULL, early.stop.round = NULL,
maximize = NULL, ...)
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA,
prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL,
feval = NULL, stratified = TRUE, folds = NULL, verbose = T,
print.every.n = 1L, early.stop.round = NULL, maximize = NULL, ...)
}
\arguments{
\item{params}{the list of parameters. Commonly used ones are:
@@ -41,7 +40,7 @@ value that represents missing value. Sometime a data use 0 or other extreme valu
\item{showsd}{\code{boolean}, whether show standard deviation of cross validation}
\item{metrics,}{list of evaluation metrics to be used in corss validation,
\item{metrics, }{list of evaluation metrics to be used in corss validation,
when it is not specified, the evaluation metric is chosen according to objective function.
Possible options are:
\itemize{
@@ -66,14 +65,14 @@ If folds are supplied, the nfold and stratified parameters would be ignored.}
\item{verbose}{\code{boolean}, print the statistics during the process}
\item{early_stop_round}{If \code{NULL}, the early stopping function is not triggered.
\item{print.every.n}{Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.}
\item{early.stop.round}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
keeps getting worse consecutively for \code{k} rounds.}
\item{early.stop.round}{An alternative of \code{early_stop_round}.}
\item{maximize}{If \code{feval} and \code{early_stop_round} are set, then \code{maximize} must be set as well.
\code{maximize=TRUE} means the larger the evaluation score the better.}
\item{maximize}{If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
\code{maximize=TRUE} means the larger the evaluation score the better.}
\item{...}{other parameters to pass to \code{params}.}
}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.dump.R
\name{xgb.dump}
\alias{xgb.dump}
@@ -19,9 +19,9 @@ See demo/ for walkthrough example in R, and
for example Format.}
\item{with.stats}{whether dump statistics of splits
When this option is on, the model dump comes with two additional statistics:
gain is the approximate loss function gain we get in each split;
cover is the sum of second order gradient in each node.}
When this option is on, the model dump comes with two additional statistics:
gain is the approximate loss function gain we get in each split;
cover is the sum of second order gradient in each node.}
}
\value{
if fname is not provided or set to \code{NULL} the function will return the model as a \code{character} vector. Otherwise it will return \code{TRUE}.

View File

@@ -1,18 +1,16 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.importance.R
\name{xgb.importance}
\alias{xgb.importance}
\title{Show importance of features in a model}
\usage{
xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL,
data = NULL, label = NULL, target = function(x) ((x + label) == 2))
xgb.importance(feature_names = NULL, model = NULL, data = NULL,
label = NULL, target = function(x) ((x + label) == 2))
}
\arguments{
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (\code{with.stats = T} in function \code{xgb.dump}).}
\item{model}{generated by the \code{xgb.train} function. Avoid the creation of a dump file.}
\item{model}{generated by the \code{xgb.train} function.}
\item{data}{the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.}
@@ -24,23 +22,24 @@ xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL,
A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
}
\description{
Read a xgboost model text dump.
Can be tree or linear model (text dump of linear model are only supported in dev version of \code{Xgboost} for now).
Create a \code{data.table} of the most important features of a model.
}
\details{
This is the function to understand the model trained (and through your model, your data).
Results are returned for both linear and tree models.
This function is for both linear and tree models.
\code{data.table} is returned by the function.
There are 3 columns :
The columns are :
\itemize{
\item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump.
\item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training ;
\item \code{Cover} metric of the number of observation related to this feature (only available for tree models) ;
\item \code{Weight} percentage representing the relative number of times a feature have been taken into trees. \code{Gain} should be prefered to search the most important feature. For boosted linear model, this column has no meaning.
\item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump;
\item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training (only available for tree models);
\item \code{Cover} metric of the number of observation related to this feature (only available for tree models);
\item \code{Weight} percentage representing the relative number of times a feature have been taken into trees.
}
If you don't provide \code{feature_names}, index of the features will be used instead.
Because the index is extracted from the model dump (made on the C++ side), it starts at 0 (usual in C++) instead of 1 (usual in R).
Co-occurence count
------------------
@@ -53,18 +52,14 @@ If you need to remember one thing only: until you want to leave us early, don't
\examples{
data(agaricus.train, package='xgboost')
# Both dataset are list with two items, a sparse matrix and labels
# (labels = outcome column which will be learned).
# Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
# train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.importance(train$data@Dimnames[[2]], model = bst)
# agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst)
# Same thing with co-occurence computation this time
xgb.importance(train$data@Dimnames[[2]], model = bst, data = train$data, label = train$label)
xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst, data = agaricus.train$data, label = agaricus.train$label)
}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.load.R
\name{xgb.load}
\alias{xgb.load}

View File

@@ -1,33 +1,33 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.model.dt.tree.R
\name{xgb.model.dt.tree}
\alias{xgb.model.dt.tree}
\title{Convert tree model dump to data.table}
\title{Parse boosted tree model text dump}
\usage{
xgb.model.dt.tree(feature_names = NULL, filename_dump = NULL,
model = NULL, text = NULL, n_first_tree = NULL)
xgb.model.dt.tree(feature_names = NULL, model = NULL, text = NULL,
n_first_tree = NULL)
}
\arguments{
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If the model already contains feature names, this argument should be \code{NULL} (default value).}
\item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}).}
\item{model}{object created by the \code{xgb.train} function.}
\item{model}{dump generated by the \code{xgb.train} function. Avoid the creation of a dump file.}
\item{text}{\code{character} vector generated by the \code{xgb.dump} function. Model dump must include the gain per feature and per tree (parameter \code{with.stats = TRUE} in function \code{xgb.dump}).}
\item{text}{dump generated by the \code{xgb.dump} function. Avoid the creation of a dump file. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}).}
\item{n_first_tree}{limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.}
\item{n_first_tree}{limit the plot to the \code{n} first trees. If set to \code{NULL}, all trees of the model are plotted. Performance can be low depending of the size of the model.}
}
\value{
A \code{data.table} of the features used in the model with their gain, cover and few other thing.
A \code{data.table} of the features used in the model with their gain, cover and few other information.
}
\description{
Read a tree model text dump and return a data.table.
Parse a boosted tree model text dump and return a \code{data.table}.
}
\details{
General function to convert a text dump of tree model to a Matrix. The purpose is to help user to explore the model and get a better understanding of it.
General function to convert a text dump of tree model to a \code{data.table}.
The content of the \code{data.table} is organised that way:
The purpose is to help user to explore the model and get a better understanding of it.
The columns of the \code{data.table} are:
\itemize{
\item \code{ID}: unique identifier of a node ;
@@ -39,21 +39,17 @@ The content of the \code{data.table} is organised that way:
\item \code{Quality}: it's the gain related to the split in this specific node ;
\item \code{Cover}: metric to measure the number of observation affected by the split ;
\item \code{Tree}: ID of the tree. It is included in the main ID ;
\item \code{Yes.X} or \code{No.X}: data related to the pointer in \code{Yes} or \code{No} column ;
\item \code{Yes.Feature}, \code{No.Feature}, \code{Yes.Cover}, \code{No.Cover}, \code{Yes.Quality} and \code{No.Quality}: data related to the pointer in \code{Yes} or \code{No} column ;
}
}
\examples{
data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)
# agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(feature_names = agaricus.train$data@Dimnames[[2]], model = bst)
}

View File

@@ -0,0 +1,46 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.deepness.R
\name{xgb.plot.deepness}
\alias{xgb.plot.deepness}
\title{Plot model trees deepness}
\usage{
xgb.plot.deepness(model = NULL)
}
\arguments{
\item{model}{dump generated by the \code{xgb.train} function.}
}
\value{
Two graphs showing the distribution of the model deepness.
}
\description{
Generate a graph to plot the distribution of deepness among trees.
}
\details{
Display both the number of \code{leaf} and the distribution of \code{weighted observations}
by tree deepness level.
The purpose of this function is to help the user to find the best trade-off to set
the \code{max.depth} and \code{min_child_weight} parameters according to the bias / variance trade-off.
See \link{xgb.train} for more information about these parameters.
The graph is made of two parts:
\itemize{
\item Count: number of leaf per level of deepness;
\item Weighted cover: noramlized weighted cover per leaf (weighted number of instances).
}
This function is inspired by the blog post \url{http://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html}
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
min_child_weight = 50)
xgb.plot.deepness(model = bst)
}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.importance.R
\name{xgb.plot.importance}
\alias{xgb.plot.importance}
@@ -15,11 +15,11 @@ xgb.plot.importance(importance_matrix = NULL, numberOfClusters = c(1:10))
A \code{ggplot2} bar graph representing each feature by a horizontal bar. Longer is the bar, more important is the feature. Features are classified by importance and clustered by importance. The group is represented through the color of the bar.
}
\description{
Read a data.table containing feature importance details and plot it.
Read a data.table containing feature importance details and plot it (for both GLM and Trees).
}
\details{
The purpose of this function is to easily represent the importance of each feature of a model.
The function return a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
The function returns a ggplot graph, therefore each of its characteristic can be overriden (to customize it).
In particular you may want to override the title of the graph. To do so, add \code{+ ggtitle("A GRAPH NAME")} next to the value returned by this function.
}
\examples{
@@ -28,13 +28,13 @@ data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#train$data@Dimnames[[2]] represents the column names of the sparse matrix.
importance_matrix <- xgb.importance(train$data@Dimnames[[2]], model = bst)
#agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
importance_matrix <- xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst)
xgb.plot.importance(importance_matrix)
}

View File

@@ -0,0 +1,58 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.multi.trees.R
\name{xgb.plot.multi.trees}
\alias{xgb.plot.multi.trees}
\title{Project all trees on one tree and plot it}
\usage{
xgb.plot.multi.trees(model, feature_names = NULL, features.keep = 5,
plot.width = NULL, plot.height = NULL)
}
\arguments{
\item{model}{dump generated by the \code{xgb.train} function.}
\item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{features.keep}{number of features to keep in each position of the multi trees.}
\item{plot.width}{width in pixels of the graph to produce}
\item{plot.height}{height in pixels of the graph to produce}
}
\value{
Two graphs showing the distribution of the model deepness.
}
\description{
Visualization of the ensemble of trees as a single collective unit.
}
\details{
This function tries to capture the complexity of gradient boosted tree ensemble
in a cohesive way.
The goal is to improve the interpretability of the model generally seen as black box.
The function is dedicated to boosting applied to decision trees only.
The purpose is to move from an ensemble of trees to a single tree only.
It takes advantage of the fact that the shape of a binary tree is only defined by
its deepness (therefore in a boosting model, all trees have the same shape).
Moreover, the trees tend to reuse the same features.
The function will project each tree on one, and keep for each position the
\code{features.keep} first features (based on Gain per feature measure).
This function is inspired by this blog post:
\url{https://wellecks.wordpress.com/2015/02/21/peering-into-the-black-box-visualizing-lambdamart/}
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 15,
eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
min_child_weight = 50)
p <- xgb.plot.multi.trees(model = bst, feature_names = agaricus.train$data@Dimnames[[2]], features.keep = 3)
print(p)
}

View File

@@ -1,58 +1,48 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.tree.R
\name{xgb.plot.tree}
\alias{xgb.plot.tree}
\title{Plot a boosted tree model}
\usage{
xgb.plot.tree(feature_names = NULL, filename_dump = NULL, model = NULL,
n_first_tree = NULL, CSSstyle = NULL, width = NULL, height = NULL)
xgb.plot.tree(feature_names = NULL, model = NULL, n_first_tree = NULL,
plot.width = NULL, plot.height = NULL)
}
\arguments{
\item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (parameter \code{with.stats = T} in function \code{xgb.dump}). Possible to provide a model directly (see \code{model} argument).}
\item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
\item{model}{generated by the \code{xgb.train} function. Avoid the creation of a dump file.}
\item{n_first_tree}{limit the plot to the n first trees. If \code{NULL}, all trees of the model are plotted. Performance can be low for huge models.}
\item{CSSstyle}{a \code{character} vector storing a css style to customize the appearance of nodes. Look at the \href{https://github.com/knsv/mermaid/wiki}{Mermaid wiki} for more information.}
\item{plot.width}{the width of the diagram in pixels.}
\item{width}{the width of the diagram in pixels.}
\item{height}{the height of the diagram in pixels.}
\item{plot.height}{the height of the diagram in pixels.}
}
\value{
A \code{DiagrammeR} of the model.
}
\description{
Read a tree model text dump.
Plotting only works for boosted tree model (not linear model).
Read a tree model text dump and plot the model.
}
\details{
The content of each node is organised that way:
\itemize{
\item \code{feature} value ;
\item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be ;
\item \code{feature} value;
\item \code{cover}: the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be;
\item \code{gain}: metric the importance of the node in the model.
}
Each branch finishes with a leaf. For each leaf, only the \code{cover} is indicated.
It uses \href{https://github.com/knsv/mermaid/}{Mermaid} library for that purpose.
The function uses \href{http://www.graphviz.org/}{GraphViz} library for that purpose.
}
\examples{
data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.plot.tree(agaricus.train$data@Dimnames[[2]], model = bst)
# agaricus.train$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.plot.tree(feature_names = agaricus.train$data@Dimnames[[2]], model = bst)
}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.save.R
\name{xgb.save}
\alias{xgb.save}

View File

@@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.save.raw.R
\name{xgb.save.raw}
\alias{xgb.save.raw}

View File

@@ -1,12 +1,13 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.train.R
\name{xgb.train}
\alias{xgb.train}
\title{eXtreme Gradient Boosting Training}
\usage{
xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
feval = NULL, verbose = 1, printEveryN=1L, early_stop_round = NULL,
early.stop.round = NULL, maximize = NULL, ...)
feval = NULL, verbose = 1, print.every.n = 1L,
early.stop.round = NULL, maximize = NULL, save_period = 0,
save_name = "xgboost.model", ...)
}
\arguments{
\item{params}{the list of parameters.
@@ -26,7 +27,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
\item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
\item \code{max_depth} maximum depth of a tree. Default: 6
\item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
@@ -43,19 +44,19 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
3. Task Parameters
\itemize{
\item \code{objective} specify the learning task and the corresponding learning objective, and the objective options are below:
\item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
\itemize{
\item \code{reg:linear} linear regression (Default).
\item \code{reg:logistic} logistic regression.
\item \code{binary:logistic} logistic regression for binary classification. Output probability.
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
\item \code{num_class} set the number of classes. To use only with multiclass objectives.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{tonum_class}.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class}.
\item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
}
\item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
\item \code{eval_metric} evaluation metrics for validation data. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
\item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
}}
\item{data}{takes an \code{xgb.DMatrix} as the input.}
@@ -63,10 +64,10 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
\item{nrounds}{the max number of iterations}
\item{watchlist}{what information should be printed when \code{verbose=1} or
\code{verbose=2}. Watchlist is used to specify validation set monitoring
during training. For example user can specify
watchlist=list(validation1=mat1, validation2=mat2) to watch
the performance of each round's model on mat1 and mat2}
\code{verbose=2}. Watchlist is used to specify validation set monitoring
during training. For example user can specify
watchlist=list(validation1=mat1, validation2=mat2) to watch
the performance of each round's model on mat1 and mat2}
\item{obj}{customized objective function. Returns gradient and second order
gradient with given prediction and dtrain,}
@@ -78,17 +79,19 @@ prediction and dtrain,}
\item{verbose}{If 0, xgboost will stay silent. If 1, xgboost will print
information of performance. If 2, xgboost will print information of both}
\item{printEveryN}{Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.}
\item{print.every.n}{Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.}
\item{early_stop_round}{If \code{NULL}, the early stopping function is not triggered.
\item{early.stop.round}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
keeps getting worse consecutively for \code{k} rounds.}
\item{early.stop.round}{An alternative of \code{early_stop_round}.}
\item{maximize}{If \code{feval} and \code{early_stop_round} are set, then \code{maximize} must be set as well.
\item{maximize}{If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
\code{maximize=TRUE} means the larger the evaluation score the better.}
\item{save_period}{save the model to the disk in every \code{save_period} rounds, 0 means no such action.}
\item{save_name}{the name or path for periodically saved model file.}
\item{...}{other parameters to pass to \code{params}.}
}
\description{
@@ -107,6 +110,7 @@ Number of threads can also be manually specified via \code{nthread} parameter.
\itemize{
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
\item \code{mlogloss} multiclass logloss. \url{https://www.kaggle.com/wiki/MultiClassLogLoss}
\item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
\item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
\item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
@@ -122,7 +126,6 @@ data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- dtrain
watchlist <- list(eval = dtest, train = dtrain)
param <- list(max.depth = 2, eta = 1, silent = 1)
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds))
@@ -135,6 +138,7 @@ evalerror <- function(preds, dtrain) {
err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
return(list(metric = "error", value = err))
}
bst <- xgb.train(param, dtrain, nthread = 2, nround = 2, watchlist, logregobj, evalerror)
param <- list(max.depth = 2, eta = 1, silent = 1, objective=logregobj,eval_metric=evalerror)
bst <- xgb.train(param, dtrain, nthread = 2, nround = 2, watchlist)
}

View File

@@ -1,12 +1,13 @@
% Generated by roxygen2 (4.1.1): do not edit by hand
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgboost.R
\name{xgboost}
\alias{xgboost}
\title{eXtreme Gradient Boosting (Tree) library}
\usage{
xgboost(data = NULL, label = NULL, missing = NULL, params = list(),
nrounds, verbose = 1, printEveryN=1L, early_stop_round = NULL, early.stop.round = NULL,
maximize = NULL, ...)
xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
params = list(), nrounds, verbose = 1, print.every.n = 1L,
early.stop.round = NULL, maximize = NULL, save_period = 0,
save_name = "xgboost.model", ...)
}
\arguments{
\item{data}{takes \code{matrix}, \code{dgCMatrix}, local data file or
@@ -18,6 +19,8 @@ if data is local data file or \code{xgb.DMatrix}.}
\item{missing}{Missing is only used when input is dense matrix, pick a float
value that represents missing value. Sometimes a data use 0 or other extreme value to represents missing values.}
\item{weight}{a vector indicating the weight for each row of the input.}
\item{params}{the list of parameters.
Commonly used ones are:
@@ -42,17 +45,19 @@ Commonly used ones are:
information of performance. If 2, xgboost will print information of both
performance and construction progress information}
\item{printEveryN}{Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.}
\item{print.every.n}{Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.}
\item{early_stop_round}{If \code{NULL}, the early stopping function is not triggered.
\item{early.stop.round}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
keeps getting worse consecutively for \code{k} rounds.}
\item{early.stop.round}{An alternative of \code{early_stop_round}.}
\item{maximize}{If \code{feval} and \code{early_stop_round} are set, then \code{maximize} must be set as well.
\item{maximize}{If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
\code{maximize=TRUE} means the larger the evaluation score the better.}
\item{save_period}{save the model to the disk in every \code{save_period} rounds, 0 means no such action.}
\item{save_name}{the name or path for periodically saved model file.}
\item{...}{other parameters to pass to \code{params}.}
}
\description{
@@ -73,5 +78,6 @@ test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
pred <- predict(bst, test$data)
}

View File

@@ -1,3 +1,4 @@
// Copyright (c) 2014 by Contributors
#include <vector>
#include <string>
#include <utility>
@@ -31,14 +32,14 @@ extern "C" {
bool CheckNAN(double v) {
return ISNAN(v);
}
bool LogGamma(double v) {
double LogGamma(double v) {
return lgammafn(v);
}
} // namespace utils
} // namespace utils
namespace random {
void Seed(unsigned seed) {
warning("parameter seed is ignored, please set random seed using set.seed");
// warning("parameter seed is ignored, please set random seed using set.seed");
}
double Uniform(void) {
return unif_rand();
@@ -58,6 +59,10 @@ inline void _WrapperEnd(void) {
PutRNGstate();
}
// do nothing, check error
inline void CheckErr(int ret) {
}
extern "C" {
SEXP XGCheckNullPtr_R(SEXP handle) {
return ScalarLogical(R_ExternalPtrAddr(handle) == NULL);
@@ -69,7 +74,8 @@ extern "C" {
}
SEXP XGDMatrixCreateFromFile_R(SEXP fname, SEXP silent) {
_WrapperBegin();
void *handle = XGDMatrixCreateFromFile(CHAR(asChar(fname)), asInteger(silent));
DMatrixHandle handle;
CheckErr(XGDMatrixCreateFromFile(CHAR(asChar(fname)), asInteger(silent), &handle));
_WrapperEnd();
SEXP ret = PROTECT(R_MakeExternalPtr(handle, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
@@ -90,7 +96,8 @@ extern "C" {
data[i * ncol +j] = din[i + nrow * j];
}
}
void *handle = XGDMatrixCreateFromMat(BeginPtr(data), nrow, ncol, asReal(missing));
DMatrixHandle handle;
CheckErr(XGDMatrixCreateFromMat(BeginPtr(data), nrow, ncol, asReal(missing), &handle));
_WrapperEnd();
SEXP ret = PROTECT(R_MakeExternalPtr(handle, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
@@ -118,8 +125,10 @@ extern "C" {
indices_[i] = static_cast<unsigned>(p_indices[i]);
data_[i] = static_cast<float>(p_data[i]);
}
void *handle = XGDMatrixCreateFromCSC(BeginPtr(col_ptr_), BeginPtr(indices_),
BeginPtr(data_), nindptr, ndata);
DMatrixHandle handle;
CheckErr(XGDMatrixCreateFromCSC(BeginPtr(col_ptr_), BeginPtr(indices_),
BeginPtr(data_), nindptr, ndata,
&handle));
_WrapperEnd();
SEXP ret = PROTECT(R_MakeExternalPtr(handle, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
@@ -133,7 +142,10 @@ extern "C" {
for (int i = 0; i < len; ++i) {
idxvec[i] = INTEGER(idxset)[i] - 1;
}
void *res = XGDMatrixSliceDMatrix(R_ExternalPtrAddr(handle), BeginPtr(idxvec), len);
DMatrixHandle res;
CheckErr(XGDMatrixSliceDMatrix(R_ExternalPtrAddr(handle),
BeginPtr(idxvec), len,
&res));
_WrapperEnd();
SEXP ret = PROTECT(R_MakeExternalPtr(res, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
@@ -142,8 +154,8 @@ extern "C" {
}
void XGDMatrixSaveBinary_R(SEXP handle, SEXP fname, SEXP silent) {
_WrapperBegin();
XGDMatrixSaveBinary(R_ExternalPtrAddr(handle),
CHAR(asChar(fname)), asInteger(silent));
CheckErr(XGDMatrixSaveBinary(R_ExternalPtrAddr(handle),
CHAR(asChar(fname)), asInteger(silent)));
_WrapperEnd();
}
void XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
@@ -156,24 +168,27 @@ extern "C" {
for (int i = 0; i < len; ++i) {
vec[i] = static_cast<unsigned>(INTEGER(array)[i]);
}
XGDMatrixSetGroup(R_ExternalPtrAddr(handle), BeginPtr(vec), len);
CheckErr(XGDMatrixSetGroup(R_ExternalPtrAddr(handle), BeginPtr(vec), len));
} else {
std::vector<float> vec(len);
#pragma omp parallel for schedule(static)
for (int i = 0; i < len; ++i) {
vec[i] = REAL(array)[i];
}
XGDMatrixSetFloatInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
BeginPtr(vec), len);
CheckErr(XGDMatrixSetFloatInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
BeginPtr(vec), len));
}
_WrapperEnd();
}
SEXP XGDMatrixGetInfo_R(SEXP handle, SEXP field) {
_WrapperBegin();
bst_ulong olen;
const float *res = XGDMatrixGetFloatInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)), &olen);
const float *res;
CheckErr(XGDMatrixGetFloatInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
&olen,
&res));
_WrapperEnd();
SEXP ret = PROTECT(allocVector(REALSXP, olen));
for (size_t i = 0; i < olen; ++i) {
@@ -183,23 +198,25 @@ extern "C" {
return ret;
}
SEXP XGDMatrixNumRow_R(SEXP handle) {
bst_ulong nrow = XGDMatrixNumRow(R_ExternalPtrAddr(handle));
bst_ulong nrow;
CheckErr(XGDMatrixNumRow(R_ExternalPtrAddr(handle), &nrow));
return ScalarInteger(static_cast<int>(nrow));
}
// functions related to booster
void _BoosterFinalizer(SEXP ext) {
if (R_ExternalPtrAddr(ext) == NULL) return;
XGBoosterFree(R_ExternalPtrAddr(ext));
CheckErr(XGBoosterFree(R_ExternalPtrAddr(ext)));
R_ClearExternalPtr(ext);
}
SEXP XGBoosterCreate_R(SEXP dmats) {
_WrapperBegin();
int len = length(dmats);
std::vector<void*> dvec;
for (int i = 0; i < len; ++i){
for (int i = 0; i < len; ++i) {
dvec.push_back(R_ExternalPtrAddr(VECTOR_ELT(dmats, i)));
}
void *handle = XGBoosterCreate(BeginPtr(dvec), dvec.size());
BoosterHandle handle;
CheckErr(XGBoosterCreate(BeginPtr(dvec), dvec.size(), &handle));
_WrapperEnd();
SEXP ret = PROTECT(R_MakeExternalPtr(handle, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _BoosterFinalizer, TRUE);
@@ -208,16 +225,16 @@ extern "C" {
}
void XGBoosterSetParam_R(SEXP handle, SEXP name, SEXP val) {
_WrapperBegin();
XGBoosterSetParam(R_ExternalPtrAddr(handle),
CHAR(asChar(name)),
CHAR(asChar(val)));
CheckErr(XGBoosterSetParam(R_ExternalPtrAddr(handle),
CHAR(asChar(name)),
CHAR(asChar(val))));
_WrapperEnd();
}
void XGBoosterUpdateOneIter_R(SEXP handle, SEXP iter, SEXP dtrain) {
_WrapperBegin();
XGBoosterUpdateOneIter(R_ExternalPtrAddr(handle),
asInteger(iter),
R_ExternalPtrAddr(dtrain));
CheckErr(XGBoosterUpdateOneIter(R_ExternalPtrAddr(handle),
asInteger(iter),
R_ExternalPtrAddr(dtrain)));
_WrapperEnd();
}
void XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP hess) {
@@ -230,9 +247,10 @@ extern "C" {
tgrad[j] = REAL(grad)[j];
thess[j] = REAL(hess)[j];
}
XGBoosterBoostOneIter(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dtrain),
BeginPtr(tgrad), BeginPtr(thess), len);
CheckErr(XGBoosterBoostOneIter(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dtrain),
BeginPtr(tgrad), BeginPtr(thess),
len));
_WrapperEnd();
}
SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evnames) {
@@ -249,21 +267,24 @@ extern "C" {
for (int i = 0; i < len; ++i) {
vec_sptr.push_back(vec_names[i].c_str());
}
const char *ret =
XGBoosterEvalOneIter(R_ExternalPtrAddr(handle),
asInteger(iter),
BeginPtr(vec_dmats), BeginPtr(vec_sptr), len);
const char *ret;
CheckErr(XGBoosterEvalOneIter(R_ExternalPtrAddr(handle),
asInteger(iter),
BeginPtr(vec_dmats),
BeginPtr(vec_sptr),
len, &ret));
_WrapperEnd();
return mkString(ret);
}
SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask, SEXP ntree_limit) {
_WrapperBegin();
bst_ulong olen;
const float *res = XGBoosterPredict(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dmat),
asInteger(option_mask),
asInteger(ntree_limit),
&olen);
const float *res;
CheckErr(XGBoosterPredict(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dmat),
asInteger(option_mask),
asInteger(ntree_limit),
&olen, &res));
_WrapperEnd();
SEXP ret = PROTECT(allocVector(REALSXP, olen));
for (size_t i = 0; i < olen; ++i) {
@@ -274,12 +295,12 @@ extern "C" {
}
void XGBoosterLoadModel_R(SEXP handle, SEXP fname) {
_WrapperBegin();
XGBoosterLoadModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname)));
CheckErr(XGBoosterLoadModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname))));
_WrapperEnd();
}
void XGBoosterSaveModel_R(SEXP handle, SEXP fname) {
_WrapperBegin();
XGBoosterSaveModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname)));
CheckErr(XGBoosterSaveModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname))));
_WrapperEnd();
}
void XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
@@ -292,7 +313,8 @@ extern "C" {
SEXP XGBoosterModelToRaw_R(SEXP handle) {
bst_ulong olen;
_WrapperBegin();
const char *raw = XGBoosterGetModelRaw(R_ExternalPtrAddr(handle), &olen);
const char *raw;
CheckErr(XGBoosterGetModelRaw(R_ExternalPtrAddr(handle), &olen, &raw));
_WrapperEnd();
SEXP ret = PROTECT(allocVector(RAWSXP, olen));
if (olen != 0) {
@@ -304,16 +326,16 @@ extern "C" {
SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats) {
_WrapperBegin();
bst_ulong olen;
const char **res =
XGBoosterDumpModel(R_ExternalPtrAddr(handle),
CHAR(asChar(fmap)),
asInteger(with_stats),
&olen);
const char **res;
CheckErr(XGBoosterDumpModel(R_ExternalPtrAddr(handle),
CHAR(asChar(fmap)),
asInteger(with_stats),
&olen, &res));
_WrapperEnd();
SEXP out = PROTECT(allocVector(STRSXP, olen));
for (size_t i = 0; i < olen; ++i) {
stringstream stream;
stream << "booster["<<i<<"]\n" << res[i];
stream << "booster[" << i <<"]\n" << res[i];
SET_STRING_ELT(out, i, mkChar(stream.str().c_str()));
}
UNPROTECT(1);

View File

@@ -1,10 +1,12 @@
#ifndef XGBOOST_WRAPPER_R_H_
#define XGBOOST_WRAPPER_R_H_
/*!
* Copyright 2014 (c) by Contributors
* \file xgboost_wrapper_R.h
* \author Tianqi Chen
* \brief R wrapper of xgboost
*/
#ifndef XGBOOST_WRAPPER_R_H_ // NOLINT(*)
#define XGBOOST_WRAPPER_R_H_ // NOLINT(*)
extern "C" {
#include <Rinternals.h>
#include <R_ext/Random.h>
@@ -153,4 +155,4 @@ extern "C" {
*/
SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats);
}
#endif // XGBOOST_WRAPPER_R_H_
#endif // XGBOOST_WRAPPER_R_H_ // NOLINT(*)

View File

@@ -1,3 +1,4 @@
// Copyright (c) 2014 by Contributors
#include <stdio.h>
#include <stdarg.h>
#include <Rinternals.h>

View File

@@ -0,0 +1,4 @@
library(testthat)
library(xgboost)
test_check("xgboost")

View File

@@ -0,0 +1,36 @@
require(xgboost)
context("basic functions")
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
set.seed(1994)
test_that("train and predict", {
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
pred <- predict(bst, test$data)
expect_equal(length(pred), 1611)
})
test_that("early stopping", {
res <- xgb.cv(data = train$data, label = train$label, max.depth = 2, nfold = 5,
eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic",
early.stop.round = 3, maximize = FALSE)
expect_true(nrow(res) < 20)
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic",
early.stop.round = 3, maximize = FALSE)
pred <- predict(bst, test$data)
expect_equal(length(pred), 1611)
})
test_that("save_period", {
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 0.3, nthread = 2, nround = 20, objective = "binary:logistic",
save_period = 10, save_name = "xgb.model")
pred <- predict(bst, test$data)
expect_equal(length(pred), 1611)
})

View File

@@ -0,0 +1,48 @@
context('Test models with custom objective')
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
test_that("custom objective works", {
watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1 / (1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0))) / length(labels)
return(list(metric = "error", value = err))
}
param <- list(max.depth=2, eta=1, nthread = 2, silent=1,
objective=logregobj, eval_metric=evalerror)
bst <- xgb.train(param, dtrain, num_round, watchlist)
expect_equal(class(bst), "xgb.Booster")
expect_equal(length(bst$raw), 1064)
attr(dtrain, 'label') <- getinfo(dtrain, 'label')
logregobjattr <- function(preds, dtrain) {
labels <- attr(dtrain, 'label')
preds <- 1 / (1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
param <- list(max.depth=2, eta=1, nthread = 2, silent = 1,
objective = logregobjattr, eval_metric = evalerror)
bst <- xgb.train(param, dtrain, num_round, watchlist)
expect_equal(class(bst), "xgb.Booster")
expect_equal(length(bst$raw), 1064)
})

View File

@@ -0,0 +1,19 @@
context('Test generalized linear models')
require(xgboost)
test_that("glm works", {
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
expect_equal(class(dtrain), "xgb.DMatrix")
expect_equal(class(dtest), "xgb.DMatrix")
param <- list(objective = "binary:logistic", booster = "gblinear",
nthread = 2, alpha = 0.0001, lambda = 1)
watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2
bst <- xgb.train(param, dtrain, num_round, watchlist)
ypred <- predict(bst, dtest)
expect_equal(length(getinfo(dtest, 'label')), 1611)
})

View File

@@ -0,0 +1,68 @@
context('Test helper functions')
require(xgboost)
require(data.table)
require(Matrix)
require(vcd)
set.seed(1982)
data(Arthritis)
data(agaricus.train, package='xgboost')
df <- data.table(Arthritis, keep.rownames = F)
df[,AgeDiscret := as.factor(round(Age / 10,0))]
df[,AgeCat := as.factor(ifelse(Age > 30, "Old", "Young"))]
df[,ID := NULL]
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
output_vector <- df[,Y := 0][Improved == "Marked",Y := 1][,Y]
bst.Tree <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 9,
eta = 1, nthread = 2, nround = 10, objective = "binary:logistic", booster = "gbtree")
bst.GLM <- xgboost(data = sparse_matrix, label = output_vector,
eta = 1, nthread = 2, nround = 10, objective = "binary:logistic", booster = "gblinear")
feature.names <- agaricus.train$data@Dimnames[[2]]
test_that("xgb.dump works", {
capture.output(print(xgb.dump(bst.Tree)))
capture.output(print(xgb.dump(bst.GLM)))
expect_true(xgb.dump(bst.Tree, 'xgb.model.dump', with.stats = T))
})
test_that("xgb.model.dt.tree works with and without feature names", {
names.dt.trees <- c("ID", "Feature", "Split", "Yes", "No", "Missing", "Quality", "Cover",
"Tree", "Yes.Feature", "Yes.Cover", "Yes.Quality", "No.Feature", "No.Cover", "No.Quality")
dt.tree <- xgb.model.dt.tree(feature_names = feature.names, model = bst.Tree)
expect_equal(names.dt.trees, names(dt.tree))
expect_equal(dim(dt.tree), c(162, 15))
xgb.model.dt.tree(model = bst.Tree)
})
test_that("xgb.importance works with and without feature names", {
importance.Tree <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst.Tree)
expect_equal(dim(importance.Tree), c(7, 4))
expect_equal(colnames(importance.Tree), c("Feature", "Gain", "Cover", "Frequency"))
xgb.importance(model = bst.Tree)
xgb.plot.importance(importance_matrix = importance.Tree)
})
test_that("xgb.importance works with GLM model", {
importance.GLM <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst.GLM)
expect_equal(dim(importance.GLM), c(10, 2))
expect_equal(colnames(importance.GLM), c("Feature", "Weight"))
xgb.importance(model = bst.GLM)
xgb.plot.importance(importance.GLM)
})
test_that("xgb.plot.tree works with and without feature names", {
xgb.plot.tree(feature_names = feature.names, model = bst.Tree)
xgb.plot.tree(model = bst.Tree)
})
test_that("xgb.plot.multi.trees works with and without feature names", {
xgb.plot.multi.trees(model = bst.Tree, feature_names = feature.names, features.keep = 3)
xgb.plot.multi.trees(model = bst.Tree, features.keep = 3)
})
test_that("xgb.plot.deepness works", {
xgb.plot.deepness(model = bst.Tree)
})

View File

@@ -0,0 +1,27 @@
context("Code is of high quality and lint free")
test_that("Code Lint", {
skip_on_cran()
skip_on_travis()
skip_if_not_installed("lintr")
my_linters <- list(
absolute_paths_linter=lintr::absolute_paths_linter,
assignment_linter=lintr::assignment_linter,
closed_curly_linter=lintr::closed_curly_linter,
commas_linter=lintr::commas_linter,
# commented_code_linter=lintr::commented_code_linter,
infix_spaces_linter=lintr::infix_spaces_linter,
line_length_linter=lintr::line_length_linter,
no_tab_linter=lintr::no_tab_linter,
object_usage_linter=lintr::object_usage_linter,
# snake_case_linter=lintr::snake_case_linter,
# multiple_dots_linter=lintr::multiple_dots_linter,
object_length_linter=lintr::object_length_linter,
open_curly_linter=lintr::open_curly_linter,
# single_quotes_linter=lintr::single_quotes_linter,
spaces_inside_linter=lintr::spaces_inside_linter,
spaces_left_parentheses_linter=lintr::spaces_left_parentheses_linter,
trailing_blank_lines_linter=lintr::trailing_blank_lines_linter,
trailing_whitespace_linter=lintr::trailing_whitespace_linter
)
# lintr::expect_lint_free(linters=my_linters) # uncomment this if you want to check code quality
})

View File

@@ -0,0 +1,32 @@
context('Test model params and call are exposed to R')
require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
bst <- xgboost(data = dtrain,
max.depth = 2,
eta = 1,
nround = 10,
nthread = 1,
verbose = 0,
objective = "binary:logistic")
test_that("call is exposed to R", {
model_call <- attr(bst, "call")
expect_is(model_call, "call")
})
test_that("params is exposed to R", {
model_params <- attr(bst, "params")
expect_is(model_params, "list")
expect_equal(model_params$eta, 1)
expect_equal(model_params$max.depth, 2)
expect_equal(model_params$objective, "binary:logistic")
})

View File

@@ -0,0 +1,14 @@
context('Test poisson regression model')
require(xgboost)
set.seed(1994)
test_that("poisson regression works", {
data(mtcars)
bst <- xgboost(data = as.matrix(mtcars[,-11]),label = mtcars[,11],
objective = 'count:poisson', nrounds=5)
expect_equal(class(bst), "xgb.Booster")
pred <- predict(bst,as.matrix(mtcars[, -11]))
expect_equal(length(pred), 32)
expect_equal(sqrt(mean( (pred - mtcars[,11]) ^ 2)), 1.16, tolerance = 0.01)
})

View File

@@ -190,7 +190,7 @@ Measure feature importance
In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the features (remember, each binary column == one value of one *categorical* feature).
```{r}
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
head(importance)
```
@@ -202,7 +202,7 @@ head(importance)
`Cover` measures the relative quantity of observations concerned by a feature.
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
`Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
### Improvement in the interpretability of feature importance data.table
@@ -213,10 +213,10 @@ One simple solution is to count the co-occurrences of a feature and a class of t
For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.
```{r}
importanceRaw <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)
importanceRaw <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)
# Cleaning for better display
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)]
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequency=NULL)]
head(importanceClean)
```

View File

@@ -57,16 +57,14 @@ devtools::install_github('dmlc/xgboost', subdir='R-package')
Cran version
------------
For stable version on *CRAN*, run:
As of 2015-03-13, xgboost was removed from the CRAN repository.
```{r installCran, eval=FALSE}
install.packages('xgboost')
```
Formerly available versions can be obtained from the CRAN [archive](http://cran.r-project.org/src/contrib/Archive/xgboost)
Learning
========
For the purpose of this tutorial we will load **Xgboost** package.
For the purpose of this tutorial we will load **XGBoost** package.
```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
@@ -117,7 +115,7 @@ dim(train$data)
dim(test$data)
```
This dataset is very small to not make the **R** package too heavy, however **Xgboost** is built to manage huge dataset very efficiently.
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.
As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):
@@ -126,7 +124,7 @@ class(train$data)[1]
class(train$label)
```
Basic Training using Xgboost
Basic Training using XGBoost
----------------------------
This step is the most critical part of the process for the quality of our model.
@@ -162,7 +160,7 @@ bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth
#### xgb.DMatrix
**Xgboost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later.
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
```{r trainingDmatrix, message=F, warning=F}
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
@@ -171,7 +169,7 @@ bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround
#### Verbose option
**Xgboost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).
@@ -190,13 +188,13 @@ bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, o
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2)
```
Basic prediction using Xgboost
Basic prediction using XGBoost
==============================
Perform the prediction
----------------------
The pupose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
```{r predicting, message=F, warning=F}
pred <- predict(bst, test$data)
@@ -213,7 +211,7 @@ These numbers doesn't look like *binary classification* `{0,1}`. We need to perf
Transform the regression in a binary classification
---------------------------------------------------
The only thing that **Xgboost** does is a *regression*. **Xgboost** is using `label` vector to build its *regression* model.
The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.
How can we use a *regression* model to perform a binary classification?
@@ -269,9 +267,9 @@ Measure learning progress with xgb.train
Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following technics will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
One way to measure progress in learning of a model is to provide to **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.
@@ -283,11 +281,11 @@ watchlist <- list(train=dtrain, test=dtest)
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchlist=watchlist, objective = "binary:logistic")
```
**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
**XGBoost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
If with your own dataset you have not such results, you should think about how you did to divide your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
@@ -300,7 +298,7 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchli
Linear boosting
---------------
Until know, all the learnings we have performed were based on boosting trees. **Xgboost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
```{r linearBoosting, message=F, warning=F}
bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
@@ -308,7 +306,7 @@ bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nr
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.
In simple cases, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
Manipulating xgb.DMatrix
------------------------
@@ -339,6 +337,17 @@ err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
print(paste("test-error=", err))
```
View feature importance/influence from the learnt model
-------------------------------------------------------
Feature importance is similar to R gbm package's relative influence (rel.inf).
```
importance_matrix <- xgb.importance(model = bst)
print(importance_matrix)
xgb.plot.importance(importance_matrix = importance_matrix)
```
View the trees from a model
---------------------------
@@ -348,14 +357,20 @@ You can dump the tree you learned using `xgb.dump` into a text file.
xgb.dump(bst, with.stats = T)
```
You can plot the trees from your model using ```xgb.plot.tree``
```
xgb.plot.tree(model = bst)
```
> if you provide a path to `fname` parameter you can save the trees to your hard drive.
Save and load models
--------------------
May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
Hopefully for you, **Xgboost** implements such functions.
Hopefully for you, **XGBoost** implements such functions.
```{r saveModel, message=F, warning=F}
# save model to binary local file
@@ -364,7 +379,7 @@ xgb.save(bst, "xgboost.model")
> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise.
An interesting test to see how identic is our saved model with the original one would be to compare the two predictions.
An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.
```{r loadModel, message=F, warning=F}
# load binary model to R
@@ -382,7 +397,7 @@ file.remove("./xgboost.model")
> result is `0`? We are good!
In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
```{r saveLoadRBinVectorModel, message=F, warning=F}
# save model to R's raw vector
@@ -399,7 +414,7 @@ pred3 <- predict(bst3, test$data)
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
```
> Again `0`? It seems that `Xgboost` works pretty well!
> Again `0`? It seems that `XGBoost` works pretty well!
References
==========

113
README.md
View File

@@ -1,57 +1,84 @@
XGBoost: eXtreme Gradient Boosting
==================================
<img src=https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png width=135/> eXtreme Gradient Boosting
===========
[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost)
[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org)
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE)
[![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost)
[![PyPI version](https://badge.fury.io/py/xgboost.svg)](https://pypi.python.org/pypi/xgboost/)
[![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version.
It implements machine learning algorithm under gradient boosting framework, including generalized linear model and gradient boosted regression tree (GBDT). XGBoost can also also distributed and scale to Terascale data
Contributors: https://github.com/dmlc/xgboost/graphs/contributors
It implements machine learning algorithms under the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) framework, including [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLM) and [Gradient Boosted Decision Trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting) (GBDT). XGBoost can also be [distributed](#features) and scale to Terascale data
Documentations: [Documentation of xgboost](doc/README.md)
XGBoost is part of [Distributed Machine Learning Common](http://dmlc.github.io/) <img src=https://avatars2.githubusercontent.com/u/11508361?v=3&s=20> projects
Issues Tracker: [https://github.com/dmlc/xgboost/issues](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+label%3Aquestion)
Please join [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) to ask questions and share your experience on xgboost.
- Use issue tracker for bug reports, feature requests etc.
- Use the user group to post your experience, ask questions about general usages.
Gitter for developers [![Gitter chat for developers at https://gitter.im/dmlc/xgboost](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dmlc/xgboost?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
Distributed Version: [Distributed XGBoost](multi-node)
Highlights of Usecases: [Highlight Links](doc/README.md#highlight-links)
Contents
--------
* [What's New](#whats-new)
* [Version](#version)
* [Documentation](doc/index.md)
* [Build Instruction](doc/build.md)
* [Features](#features)
* [Distributed XGBoost](multi-node)
* [Usecases](doc/index.md#highlight-links)
* [Bug Reporting](#bug-reporting)
* [Contributing to XGBoost](#contributing-to-xgboost)
* [Committers and Contributors](CONTRIBUTORS.md)
* [License](#license)
* [XGBoost in Graphlab Create](#xgboost-in-graphlab-create)
What's New
==========
----------
* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
* XGBoost helps Owen Zhang to win the [Avito Context Ad Click competition](https://www.kaggle.com/c/avito-context-ad-clicks). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/).
* XGBoost helps Chenglong Chen to win [Kaggle CrowdFlower Competition](https://www.kaggle.com/c/crowdflower-search-relevance)
Check out the [winning solution](https://github.com/ChenglongChen/Kaggle_CrowdFlower)
* XGBoost-0.4 release, see [CHANGES.md](CHANGES.md#xgboost-04)
* XGBoost wins [WWW2015 Microsoft Malware Classification Challenge (BIG 2015)](http://www.kaggle.com/c/malware-classification/forums/t/13490/say-no-to-overfitting-approaches-sharing)
- Checkout the winning solution at [Highlight links](doc/README.md#highlight-links)
* XGBoost helps three champion teams to win [WWW2015 Microsoft Malware Classification Challenge (BIG 2015)](http://www.kaggle.com/c/malware-classification/forums/t/13490/say-no-to-overfitting-approaches-sharing)
Check out the [winning solution](doc/README.md#highlight-links)
* [External Memory Version](doc/external_memory.md)
Features
========
* Easily accessible in python, R, Julia, CLI
* Fast speed and memory efficient
- Can be more than 10 times faster than GBM in sklearn and R
- Handles sparse matrices, support external memory
* Accurate prediction, and used extensively by data scientists and kagglers
- See [highlight links](https://github.com/dmlc/xgboost/blob/master/doc/README.md#highlight-links)
* Distributed and Portable
- The distributed version runs on Hadoop (YARN), MPI, SGE etc.
- Scales to billions of examples and beyond
Build
=======
* Run ```bash build.sh``` (you can also type make)
- Normally it gives what you want
- See [Build Instruction](doc/build.md) for more information
Version
=======
* Current version xgboost-0.4, a lot improvment has been made since 0.3
- Change log in [CHANGES.md](CHANGES.md)
-------
* Current version xgboost-0.4
- [Change log](CHANGES.md)
- This version is compatible with 0.3x versions
Features
--------
* Easily accessible through CLI, [python](https://github.com/dmlc/xgboost/blob/master/demo/guide-python/basic_walkthrough.py),
[R](https://github.com/dmlc/xgboost/blob/master/R-package/demo/basic_walkthrough.R),
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/basic_walkthrough.jl)
* Its fast! Benchmark numbers comparing xgboost, H20, Spark, R - [benchm-ml numbers](https://github.com/szilard/benchm-ml)
* Memory efficient - Handles sparse matrices, supports external memory
* Accurate prediction, and used extensively by data scientists and kagglers - [highlight links](https://github.com/dmlc/xgboost/blob/master/doc/README.md#highlight-links)
* Distributed version runs on Hadoop (YARN), MPI, SGE etc., scales to billions of examples.
Bug Reporting
-------------
* For reporting bugs please use the [xgboost/issues](https://github.com/dmlc/xgboost/issues) page.
* For generic questions or to share your experience using xgboost please use the [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/)
Contributing to XGBoost
-----------------------
XGBoost has been developed and used by a group of active community members. Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
* Check out [Feature Wish List](https://github.com/dmlc/xgboost/labels/Wish-List) to see what can be improved, or open an issue if you want something.
* Contribute to the [documents and examples](https://github.com/dmlc/xgboost/blob/master/doc/) to share your experience with other users.
* Please add your name to [CONTRIBUTORS.md](CONTRIBUTORS.md) after your patch has been merged.
License
-------
© Contributors, 2015. Licensed under an [Apache-2](https://github.com/dmlc/xgboost/blob/master/LICENSE) license.
XGBoost in Graphlab Create
==========================
* XGBoost is adopted as part of boosted tree toolkit in Graphlab Create (GLC). Graphlab Create is a powerful python toolkit that allows you to data manipulation, graph processing, hyper-parameter search, and visualization of TeraBytes scale data in one framework. Try the Graphlab Create in http://graphlab.com/products/create/quick-start-guide.html
* Nice blogpost by Jay Gu using GLC boosted tree to solve kaggle bike sharing challenge: http://blog.graphlab.com/using-gradient-boosted-trees-to-predict-bike-sharing-demand
--------------------------
* XGBoost is adopted as part of boosted tree toolkit in Graphlab Create (GLC). Graphlab Create is a powerful python toolkit that allows you to do data manipulation, graph processing, hyper-parameter search, and visualization of TeraBytes scale data in one framework. Try the [Graphlab Create](http://graphlab.com/products/create/quick-start-guide.html)
* Nice [blogpost](http://blog.graphlab.com/using-gradient-boosted-trees-to-predict-bike-sharing-demand) by Jay Gu about using GLC boosted tree to solve kaggle bike sharing challenge:

36
appveyor.yml Normal file
View File

@@ -0,0 +1,36 @@
environment:
global:
CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\python-appveyor-demo\\appveyor\\run_with_env.cmd"
DISABLE_OPENMP: 1
VisualStudioVersion: 12.0
matrix:
- PYTHON: "C:\\Python27-x64"
PYTHON_VERSION: "2.7.x" # currently 2.7.9
PYTHON_ARCH: "64"
- PYTHON: "C:\\Python33-x64"
PYTHON_VERSION: "3.3.x" # currently 3.3.5
PYTHON_ARCH: "64"
platform:
- x64
configuration:
- Release
install:
- cmd: git clone https://github.com/ogrisel/python-appveyor-demo
- ECHO "Filesystem root:"
- ps: "ls \"C:/\""
- ECHO "Installed SDKs:"
- ps: "ls \"C:/Program Files/Microsoft SDKs/Windows\""
- ps: python-appveyor-demo\appveyor\install.ps1
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
- "python --version"
- "python -c \"import struct; print(struct.calcsize('P') * 8)\""
build: off
#project: windows\xgboost.sln

View File

@@ -6,6 +6,18 @@
# See additional instruction in doc/build.md
#for building static OpenMP lib in MAC for easier installation in MAC
#doesn't work with XCode clang/LLVM since Apple doesn't support,
#needs brew install gcc 4.9+ with OpenMP. By default the static link is OFF
static_omp=0
if ((${static_omp}==1)); then
rm libgomp.a
ln -s `g++ -print-file-name=libgomp.a`
make clean
make omp_mac_static=1
echo "Successfully build multi-thread static link xgboost"
exit 0
fi
if make; then
echo "Successfully build multi-thread xgboost"

1
demo/.gitignore vendored
View File

@@ -1 +1,2 @@
*.libsvm
*.pkl

View File

@@ -1,14 +1,14 @@
XGBoost Examples
====
XGBoost Code Examples
=====================
This folder contains all the code examples using xgboost.
* Contribution of examples, benchmarks is more than welcome!
* If you like to share how you use xgboost to solve your problem, send a pull request:)
Features Walkthrough
====
This is a list of short codes introducing different functionalities of xgboost and its wrapper.
* Basic walkthrough of wrappers
--------------------
This is a list of short codes introducing different functionalities of xgboost packages.
* Basic walkthrough of packages
[python](guide-python/basic_walkthrough.py)
[R](../R-package/demo/basic_walkthrough.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/basic_walkthrough.jl)
@@ -22,8 +22,8 @@ This is a list of short codes introducing different functionalities of xgboost a
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
* Predicting using first n trees
[python](guide-python/predict_first_ntree.py)
[R](../R-package/demo/boost_from_prediction.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
[R](../R-package/demo/predict_first_ntree.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/predict_first_ntree.jl)
* Generalized Linear Model
[python](guide-python/generalized_linear_model.py)
[R](../R-package/demo/generalized_linear_model.R)
@@ -37,7 +37,7 @@ This is a list of short codes introducing different functionalities of xgboost a
[R](../R-package/demo/predict_leaf_indices.R)
Basic Examples by Tasks
====
-----------------------
Most of examples in this section are based on CLI or python version.
However, the parameter settings can be applied to all versions
* [Binary classification](binary_classification)
@@ -46,7 +46,6 @@ However, the parameter settings can be applied to all versions
* [Learning to Rank](rank)
Benchmarks
====
----------
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)

View File

@@ -147,7 +147,7 @@ Run the command again, we can find the log file becomes
```
The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round.
xgboost also support monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes
xgboost also supports monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes
```
[0] test-error:0.016139 test-negllik:0.029795 trainname-error:0.014433 trainname-negllik:0.027023
[1] test-error:0.000000 test-negllik:0.000000 trainname-error:0.001228 trainname-negllik:0.002457
@@ -162,11 +162,15 @@ If you want to continue boosting from existing model, say 0002.model, use
```
xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
#### Use Multi-Threading
When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running threads to 10, add ```nthread=10``` to your configuration.
When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration.
Eg. ```nthread=10```
Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```)
Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8.
#### Additional Notes
* What are ```agaricus.txt.test.buffer``` and ```agaricus.txt.train.buffer``` generated during runexp.sh?
- By default xgboost will automatically generate a binary format buffer of input data, with suffix ```buffer```. When next time you run xgboost, it detects i
Demonstrating how to use XGBoost accomplish binary classification tasks on UCI mushroom dataset http://archive.ics.uci.edu/ml/datasets/Mushroom
- By default xgboost will automatically generate a binary format buffer of input data, with suffix ```buffer```. Next time when you run xgboost, it will detects these binary files.

View File

@@ -1,5 +1,5 @@
XGBoost Python Feature Walkthrough
====
==================================
* [Basic walkthrough of wrappers](basic_walkthrough.py)
* [Cutomize loss function, and evaluation metric](custom_objective.py)
* [Boosting from existing prediction](boost_from_prediction.py)
@@ -7,5 +7,8 @@ XGBoost Python Feature Walkthrough
* [Generalized Linear Model](generalized_linear_model.py)
* [Cross validation](cross_validation.py)
* [Predicting leaf indices](predict_leaf_indices.py)
* [Sklearn Wrapper](sklearn_example.py)
* [Sklearn Wrapper](sklearn_examples.py)
* [Sklearn Parallel](sklearn_parallel.py)
* [Sklearn access evals result](sklearn_evals_result.py)
* [Access evals result](evals_result.py)
* [External Memory](external_memory.py)

View File

@@ -1,6 +1,7 @@
#!/usr/bin/python
import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb
### simple example
@@ -19,7 +20,7 @@ bst = xgb.train(param, dtrain, num_round, watchlist)
# this is prediction
preds = bst.predict(dtest)
labels = dtest.get_label()
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
print ('error=%f' % ( sum(1 for i in range(len(preds)) if int(preds[i]>0.5)!=labels[i]) /float(len(preds))))
bst.save_model('0001.model')
# dump model
bst.dump_model('dump.raw.txt')
@@ -28,6 +29,7 @@ bst.dump_model('dump.nice.txt','../data/featmap.txt')
# save dmatrix into binary buffer
dtest.save_binary('dtest.buffer')
# save model
bst.save_model('xgb.model')
# load model and data in
bst2 = xgb.Booster(model_file='xgb.model')
@@ -36,6 +38,14 @@ preds2 = bst2.predict(dtest2)
# assert they are the same
assert np.sum(np.abs(preds2-preds)) == 0
# alternatively, you can pickle the booster
pks = pickle.dumps(bst2)
# load model and data in
bst3 = pickle.loads(pks)
preds3 = bst2.predict(dtest2)
# assert they are the same
assert np.sum(np.abs(preds3-preds)) == 0
###
# build dmatrix from scipy.sparse
print ('start running example of build DMatrix from scipy.sparse CSR Matrix')
@@ -44,22 +54,22 @@ row = []; col = []; dat = []
i = 0
for l in open('../data/agaricus.txt.train'):
arr = l.split()
labels.append( int(arr[0]))
labels.append(int(arr[0]))
for it in arr[1:]:
k,v = it.split(':')
row.append(i); col.append(int(k)); dat.append(float(v))
i += 1
csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
dtrain = xgb.DMatrix( csr, label = labels )
csr = scipy.sparse.csr_matrix((dat, (row,col)))
dtrain = xgb.DMatrix(csr, label = labels)
watchlist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, watchlist )
bst = xgb.train(param, dtrain, num_round, watchlist)
print ('start running example of build DMatrix from scipy.sparse CSC Matrix')
# we can also construct from csc matrix
csc = scipy.sparse.csc_matrix( (dat, (row,col)) )
csc = scipy.sparse.csc_matrix((dat, (row,col)))
dtrain = xgb.DMatrix(csc, label=labels)
watchlist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, watchlist )
bst = xgb.train(param, dtrain, num_round, watchlist)
print ('start running example of build DMatrix from numpy array')
# NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation
@@ -67,6 +77,6 @@ print ('start running example of build DMatrix from numpy array')
npymat = csr.todense()
dtrain = xgb.DMatrix(npymat, label = labels)
watchlist = [(dtest,'eval'), (dtrain,'train')]
bst = xgb.train( param, dtrain, num_round, watchlist )
bst = xgb.train(param, dtrain, num_round, watchlist)

View File

@@ -0,0 +1,30 @@
##
# This script demonstrate how to access the eval metrics in xgboost
##
import xgboost as xgb
dtrain = xgb.DMatrix('../data/agaricus.txt.train', silent=True)
dtest = xgb.DMatrix('../data/agaricus.txt.test', silent=True)
param = [('max_depth', 2), ('objective', 'binary:logistic'), ('eval_metric', 'logloss'), ('eval_metric', 'error')]
num_round = 2
watchlist = [(dtest,'eval'), (dtrain,'train')]
evals_result = {}
bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=evals_result)
print('Access logloss metric directly from evals_result:')
print(evals_result['eval']['logloss'])
print('')
print('Access metrics through a loop:')
for e_name, e_mtrs in evals_result.items():
print('- {}'.format(e_name))
for e_mtr_name, e_mtr_vals in e_mtrs.items():
print(' - {}'.format(e_mtr_name))
print(' - {}'.format(e_mtr_vals))
print('')
print('Access complete dictionary:')
print(evals_result)

View File

@@ -2,7 +2,11 @@
python basic_walkthrough.py
python custom_objective.py
python boost_from_prediction.py
python predict_first_ntree.py
python generalized_linear_model.py
python cross_validation.py
python predict_leaf_indices.py
python sklearn_examples.py
python sklearn_parallel.py
python external_memory.py
rm -rf *~ *.model *.buffer

View File

@@ -0,0 +1,43 @@
##
# This script demonstrate how to access the xgboost eval metrics by using sklearn
##
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_hastie_10_2
X, y = make_hastie_10_2(n_samples=2000, random_state=42)
# Map labels from {-1, 1} to {0, 1}
labels, y = np.unique(y, return_inverse=True)
X_train, X_test = X[:1600], X[1600:]
y_train, y_test = y[:1600], y[1600:]
param_dist = {'objective':'binary:logistic', 'n_estimators':2}
clf = xgb.XGBModel(**param_dist)
# Or you can use: clf = xgb.XGBClassifier(**param_dist)
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='logloss',
verbose=True)
# Load evals result by calling the evals_result() function
evals_result = clf.evals_result()
print('Access logloss metric directly from validation_0:')
print(evals_result['validation_0']['logloss'])
print('')
print('Access metrics through a loop:')
for e_name, e_mtrs in evals_result.items():
print('- {}'.format(e_name))
for e_mtr_name, e_mtr_vals in e_mtrs.items():
print(' - {}'.format(e_mtr_name))
print(' - {}'.format(e_mtr_vals))
print('')
print('Access complete dict:')
print(evals_result)

View File

@@ -8,7 +8,7 @@ import pickle
import xgboost as xgb
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_iris, load_digits, load_boston
@@ -65,3 +65,13 @@ print("Pickling sklearn API models")
pickle.dump(clf, open("best_boston.pkl", "wb"))
clf2 = pickle.load(open("best_boston.pkl", "rb"))
print(np.allclose(clf.predict(X), clf2.predict(X)))
# Early-stopping
X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="auc",
eval_set=[(X_test, y_test)])

View File

@@ -45,7 +45,7 @@ dim(train)
train[1:6,1:5, with =F]
# Test dataset dimensions
dim(train)
dim(test)
# Test content
test[1:6,1:5, with =F]

View File

@@ -6,7 +6,7 @@ Using XGBoost for regression is very similar to using it for binary classificati
The dataset we used is the [computer hardware dataset from UCI repository](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware). The demo for regression is almost the same as the [binary classification demo](../binary_classification), except a little difference in general parameter:
```
# General parameter
# this is the only difference with classification, use reg:linear to do linear classification
# this is the only difference with classification, use reg:linear to do linear regression
# when labels are in [0,1] we can also use reg:logistic
objective = reg:linear
...

7
doc/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
html
latex
*.sh
_*
doxygen
parser.py
*.pyc

192
doc/Makefile Normal file
View File

@@ -0,0 +1,192 @@
# Makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
endif
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext
help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " applehelp to make an Apple Help Book"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " xml to make Docutils-native XML files"
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
@echo " coverage to run coverage check of the documentation (if enabled)"
clean:
rm -rf $(BUILDDIR)/*
html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."
json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."
htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."
qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/rabit.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/rabit.qhc"
applehelp:
$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
@echo
@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
@echo "N.B. You won't be able to view it unless you put it in" \
"~/Library/Documentation/Help or install it in your application" \
"bundle."
devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/rabit"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/rabit"
@echo "# devhelp"
epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
latexpdfja:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through platex and dvipdfmx..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."
man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."
info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."
linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
coverage:
$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
@echo "Testing of coverage in the sources finished, look at the " \
"results in $(BUILDDIR)/coverage/python.txt."
xml:
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
@echo
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
pseudoxml:
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
@echo
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."

7
doc/README Normal file
View File

@@ -0,0 +1,7 @@
The document of xgboost is generated with recommonmark and sphinx.
You can build it locally by typing "make html" in this folder.
- clone https://github.com/tqchen/recommonmark to root
- type make html
Checkout https://recommonmark.readthedocs.org for guide on how to write markdown with extensions used in this doc, such as math formulas and table of content.

Some files were not shown because too many files have changed in this diff Show More