Compare commits

..

20 Commits

Author SHA1 Message Date
Hyunsu Cho
eb69c6110a Bump version to 1.5.1 2021-11-22 14:29:59 -08:00
Jiaming Yuan
0f9ffcdc16 [backport] Fix R CRAN failures. (#7404) (#7451)
* Remove hist builder dtor.

* Initialize values.

* Tolerance.

* Remove the use of nthread in col maker.
2021-11-19 21:40:04 +08:00
Jiaming Yuan
9bbd00a49f [backport] Set use_logger in tracker to false. (#7438) (#7439) 2021-11-16 09:51:37 +08:00
Jiaming Yuan
7e239f229c [CI] Install igraph as binary. (#7417) (#7430) 2021-11-13 01:53:41 +08:00
Jiaming Yuan
a013942649 Check number of trees in inplace predict. (#7409) (#7424) 2021-11-12 19:31:31 +08:00
Jiaming Yuan
4d2ea0d4ef [backport] [doc] Fix broken links. (#7341) (#7418)
* Fix most of the link checks from sphinx.
* Remove duplicate explicit target name.
2021-11-11 19:33:02 +08:00
Jiaming Yuan
d1052b5cfe [jvm-packages] Fix json4s binary compatibility issue (#7376) (#7414)
Spark 3.2 depends on 3.7.0-M11 which has changed some implicited functions'
signatures. And it will result the xgboost4j built against spark 3.0/3.1
failed when saving the model.

Co-authored-by: Bobby Wang <wbo4958@gmail.com>
2021-11-10 21:25:11 +08:00
Jiaming Yuan
14c56f05da [backport] Handle missing values in dataframe with category dtype. (#7331) (#7413)
* Handle missing values in dataframe with category dtype. (#7331)

* Replace -1 in pandas initializer.
* Unify `IsValid` functor.
* Mimic pandas data handling in cuDF glue code.
* Check invalid categories.
* Fix DDM sketching.

* Fix pick error.
2021-11-10 21:24:46 +08:00
Jiaming Yuan
11f8b5cfcd [backport] Support building with CTK11.5. (#7379) (#7411)
* Support building with CTK11.5.

* Require system cub installation for CTK11.4+.
* Check thrust version for segmented sort.
2021-11-10 19:23:29 +08:00
Jiaming Yuan
e7ac2486eb [backport] [R] Fix global feature importance and predict with 1 sample. (#7394) (#7397)
* [R] Fix global feature importance.

* Add implementation for tree index.  The parameter is not documented in C API since we
should work on porting the model slicing to R instead of supporting more use of tree
index.

* Fix the difference between "gain" and "total_gain".

* debug.

* Fix prediction.
2021-11-06 00:07:36 +08:00
Jiaming Yuan
a3d195e73e Handle OMP_THREAD_LIMIT. (#7390) (#7391) 2021-11-03 20:25:51 +08:00
Jiaming Yuan
fab3c05ced Move macos test to github action. (#7382) (#7392)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2021-11-03 18:39:47 +08:00
Jiaming Yuan
584b45a9cc Release 1.5.0. (#7317) 2021-10-15 12:21:04 +08:00
Jiaming Yuan
30c1b5c54c [backport] Fix prediction with cat data in sklearn interface. (#7306) (#7312)
* Specify DMatrix parameter for pre-processing dataframe.
* Add document about the behaviour of prediction.
2021-10-12 18:49:57 +08:00
Jiaming Yuan
36e247aca4 Fix weighted samples in multi-class AUC. (#7300) (#7305) 2021-10-11 18:00:36 +08:00
Jiaming Yuan
c4aff733bb [backport] Fix cv verbose_eval (#7291) (#7296) 2021-10-08 14:24:27 +08:00
Jiaming Yuan
cdbfd21d31 [backport] Fix gamma neg log likelihood. (#7275) (#7285) 2021-10-05 23:01:11 +08:00
Jiaming Yuan
508a0b0dbd [backport] [R] Fix document for nthread. (#7263) (#7269) 2021-09-28 14:41:32 +08:00
Jiaming Yuan
e04e773f9f Add RC1 tag for building packages. (#7261) 2021-09-28 11:50:18 +08:00
Jiaming Yuan
1debabb321 Change version to 1.5.0. (#7258) 2021-09-26 13:27:54 +08:00
459 changed files with 10879 additions and 23177 deletions

View File

@@ -1,214 +0,0 @@
---
Language: Cpp
# BasedOnStyle: Google
AccessModifierOffset: -1
AlignAfterOpenBracket: Align
AlignArrayOfStructures: None
AlignConsecutiveMacros: None
AlignConsecutiveAssignments: None
AlignConsecutiveBitFields: None
AlignConsecutiveDeclarations: None
AlignEscapedNewlines: Left
AlignOperands: Align
AlignTrailingComments: true
AllowAllArgumentsOnNextLine: true
AllowAllParametersOfDeclarationOnNextLine: true
AllowShortEnumsOnASingleLine: true
AllowShortBlocksOnASingleLine: Never
AllowShortCaseLabelsOnASingleLine: false
AllowShortFunctionsOnASingleLine: All
AllowShortLambdasOnASingleLine: All
AllowShortIfStatementsOnASingleLine: WithoutElse
AllowShortLoopsOnASingleLine: true
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: true
AlwaysBreakTemplateDeclarations: Yes
AttributeMacros:
- __capability
BinPackArguments: true
BinPackParameters: true
BraceWrapping:
AfterCaseLabel: false
AfterClass: false
AfterControlStatement: Never
AfterEnum: false
AfterFunction: false
AfterNamespace: false
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
AfterExternBlock: false
BeforeCatch: false
BeforeElse: false
BeforeLambdaBody: false
BeforeWhile: false
IndentBraces: false
SplitEmptyFunction: true
SplitEmptyRecord: true
SplitEmptyNamespace: true
BreakBeforeBinaryOperators: None
BreakBeforeConceptDeclarations: true
BreakBeforeBraces: Attach
BreakBeforeInheritanceComma: false
BreakInheritanceList: BeforeColon
BreakBeforeTernaryOperators: true
BreakConstructorInitializersBeforeComma: false
BreakConstructorInitializers: BeforeColon
BreakAfterJavaFieldAnnotations: false
BreakStringLiterals: true
ColumnLimit: 100
CommentPragmas: '^ IWYU pragma:'
QualifierAlignment: Leave
CompactNamespaces: false
ConstructorInitializerIndentWidth: 4
ContinuationIndentWidth: 4
Cpp11BracedListStyle: true
DeriveLineEnding: true
DerivePointerAlignment: true
DisableFormat: false
EmptyLineAfterAccessModifier: Never
EmptyLineBeforeAccessModifier: LogicalBlock
ExperimentalAutoDetectBinPacking: false
PackConstructorInitializers: NextLine
BasedOnStyle: ''
ConstructorInitializerAllOnOneLineOrOnePerLine: false
AllowAllConstructorInitializersOnNextLine: true
FixNamespaceComments: true
ForEachMacros:
- foreach
- Q_FOREACH
- BOOST_FOREACH
IfMacros:
- KJ_IF_MAYBE
IncludeBlocks: Regroup
IncludeCategories:
- Regex: '^<ext/.*\.h>'
Priority: 2
SortPriority: 0
CaseSensitive: false
- Regex: '^<.*\.h>'
Priority: 1
SortPriority: 0
CaseSensitive: false
- Regex: '^<.*'
Priority: 2
SortPriority: 0
CaseSensitive: false
- Regex: '.*'
Priority: 3
SortPriority: 0
CaseSensitive: false
IncludeIsMainRegex: '([-_](test|unittest))?$'
IncludeIsMainSourceRegex: ''
IndentAccessModifiers: false
IndentCaseLabels: true
IndentCaseBlocks: false
IndentGotoLabels: true
IndentPPDirectives: None
IndentExternBlock: AfterExternBlock
IndentRequires: false
IndentWidth: 2
IndentWrappedFunctionNames: false
InsertTrailingCommas: None
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: false
LambdaBodyIndentation: Signature
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBinPackProtocolList: Never
ObjCBlockIndentWidth: 2
ObjCBreakBeforeNestedBlockParam: true
ObjCSpaceAfterProperty: false
ObjCSpaceBeforeProtocolList: true
PenaltyBreakAssignment: 2
PenaltyBreakBeforeFirstCallParameter: 1
PenaltyBreakComment: 300
PenaltyBreakFirstLessLess: 120
PenaltyBreakString: 1000
PenaltyBreakTemplateDeclaration: 10
PenaltyExcessCharacter: 1000000
PenaltyReturnTypeOnItsOwnLine: 200
PenaltyIndentedWhitespace: 0
PointerAlignment: Left
PPIndentWidth: -1
RawStringFormats:
- Language: Cpp
Delimiters:
- cc
- CC
- cpp
- Cpp
- CPP
- 'c++'
- 'C++'
CanonicalDelimiter: ''
BasedOnStyle: google
- Language: TextProto
Delimiters:
- pb
- PB
- proto
- PROTO
EnclosingFunctions:
- EqualsProto
- EquivToProto
- PARSE_PARTIAL_TEXT_PROTO
- PARSE_TEST_PROTO
- PARSE_TEXT_PROTO
- ParseTextOrDie
- ParseTextProtoOrDie
- ParseTestProto
- ParsePartialTestProto
CanonicalDelimiter: pb
BasedOnStyle: google
ReferenceAlignment: Pointer
ReflowComments: true
ShortNamespaceLines: 1
SortIncludes: CaseSensitive
SortJavaStaticImport: Before
SortUsingDeclarations: true
SpaceAfterCStyleCast: false
SpaceAfterLogicalNot: false
SpaceAfterTemplateKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceBeforeCaseColon: false
SpaceBeforeCpp11BracedList: false
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeParens: ControlStatements
SpaceAroundPointerQualifiers: Default
SpaceBeforeRangeBasedForLoopColon: true
SpaceInEmptyBlock: false
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 2
SpacesInAngles: Never
SpacesInConditionalStatement: false
SpacesInContainerLiterals: true
SpacesInCStyleCastParentheses: false
SpacesInLineCommentPrefix:
Minimum: 1
Maximum: -1
SpacesInParentheses: false
SpacesInSquareBrackets: false
SpaceBeforeSquareBrackets: false
BitFieldColonSpacing: Both
Standard: Auto
StatementAttributeLikeMacros:
- Q_EMIT
StatementMacros:
- Q_UNUSED
- QT_REQUIRE_VERSION
TabWidth: 8
UseCRLF: false
UseTab: Never
WhitespaceSensitiveMacros:
- STRINGIZE
- PP_STRINGIZE
- BOOST_PP_STRINGIZE
- NS_SWIFT_NAME
- CF_SWIFT_NAME
...

View File

@@ -21,7 +21,10 @@ jobs:
submodules: 'true'
- name: Install system packages
run: |
# Use libomp 11.1.0: https://github.com/dmlc/xgboost/issues/7039
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew install ninja libomp
brew pin libomp
- name: Build gtest binary
run: |
mkdir build

View File

@@ -17,7 +17,10 @@ jobs:
- name: Install osx system dependencies
if: matrix.os == 'macos-10.15'
run: |
# Use libomp 11.1.0: https://github.com/dmlc/xgboost/issues/7039
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew install ninja libomp
brew pin libomp
- name: Install Ubuntu system dependencies
if: matrix.os == 'ubuntu-latest'
run: |
@@ -42,59 +45,13 @@ jobs:
cd ..
python -c 'import xgboost'
python-tests-on-win:
name: Test XGBoost Python package on ${{ matrix.config.os }}
runs-on: ${{ matrix.config.os }}
strategy:
matrix:
config:
- {os: windows-latest, python-version: '3.8'}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
python-version: ${{ matrix.config.python-version }}
activate-environment: win64_env
environment-file: tests/ci_build/conda_env/win64_cpu_test.yml
- name: Display Conda env
shell: bash -l {0}
run: |
conda info
conda list
- name: Build XGBoost on Windows
shell: bash -l {0}
run: |
mkdir build_msvc
cd build_msvc
cmake .. -G"Visual Studio 17 2022" -DCMAKE_CONFIGURATION_TYPES="Release" -A x64 -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON
cmake --build . --config Release --parallel $(nproc)
- name: Install Python package
shell: bash -l {0}
run: |
cd python-package
python --version
python setup.py bdist_wheel --universal
pip install ./dist/*.whl
- name: Test Python package
shell: bash -l {0}
run: |
pytest -s -v ./tests/python
python-tests-on-macos:
python-tests:
name: Test XGBoost Python package on ${{ matrix.config.os }}
runs-on: ${{ matrix.config.os }}
strategy:
matrix:
config:
- {os: windows-2016, python-version: '3.8'}
- {os: macos-10.15, python-version "3.8" }
steps:
@@ -106,8 +63,8 @@ jobs:
with:
auto-update-conda: true
python-version: ${{ matrix.config.python-version }}
activate-environment: macos_test
environment-file: tests/ci_build/conda_env/macos_cpu_test.yml
activate-environment: win64_test
environment-file: tests/ci_build/conda_env/win64_cpu_test.yml
- name: Display Conda env
shell: bash -l {0}
@@ -115,17 +72,25 @@ jobs:
conda info
conda list
- name: Build XGBoost on macos
- name: Build XGBoost on Windows
shell: bash -l {0}
if: matrix.config.os == 'windows-2016'
run: |
brew install ninja
mkdir build_msvc
cd build_msvc
cmake .. -G"Visual Studio 15 2017" -DCMAKE_CONFIGURATION_TYPES="Release" -A x64 -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON
cmake --build . --config Release --parallel $(nproc)
- name: Build XGBoost on macos
if: matrix.config.os == 'macos-10.15'
run: |
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew install ninja libomp
brew pin libomp
mkdir build
cd build
# Set prefix, to use OpenMP library from Conda env
# See https://github.com/dmlc/xgboost/issues/7039#issuecomment-1025038228
# to learn why we don't use libomp from Homebrew.
cmake .. -GNinja -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON -DCMAKE_PREFIX_PATH=$CONDA_PREFIX
cmake .. -GNinja -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON
ninja
- name: Install Python package
@@ -140,3 +105,21 @@ jobs:
shell: bash -l {0}
run: |
pytest -s -v ./tests/python
- name: Rename Python wheel
shell: bash -l {0}
if: matrix.config.os == 'macos-10.15'
run: |
TAG=macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64
python tests/ci_build/rename_whl.py python-package/dist/*.whl ${{ github.sha }} ${TAG}
- name: Upload Python wheel
shell: bash -l {0}
if: |
(github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')) &&
matrix.os == 'macos-latest'
run: |
python -m awscli s3 cp python-package/dist/*.whl s3://xgboost-nightly-builds/${{ steps.extract_branch.outputs.branch }}/ --acl public-read
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_IAM_S3_UPLOADER }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_IAM_S3_UPLOADER }}

View File

@@ -1,38 +0,0 @@
name: XGBoost-Python-Wheels
on: [push, pull_request]
jobs:
python-wheels:
name: Build wheel for ${{ matrix.platform_id }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
include:
- os: macos-latest
platform_id: macosx_x86_64
- os: macos-latest
platform_id: macosx_arm64
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Build wheels
run: bash tests/ci_build/build_python_wheels.sh ${{ matrix.platform_id }} ${{ github.sha }}
- name: Extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
if: github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')
- name: Upload Python wheel
if: github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')
run: |
python -m pip install awscli
python -m awscli s3 cp wheelhouse/*.whl s3://xgboost-nightly-builds/${{ steps.extract_branch.outputs.branch }}/ --acl public-read
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_IAM_S3_UPLOADER }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_IAM_S3_UPLOADER }}

View File

@@ -31,8 +31,8 @@ jobs:
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-3-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-3-${{ hashFiles('R-package/DESCRIPTION') }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
- name: Install dependencies
shell: Rscript {0}
@@ -59,9 +59,9 @@ jobs:
fail-fast: false
matrix:
config:
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-latest, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-2016, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'cmake'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
@@ -79,8 +79,8 @@ jobs:
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-3-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-3-${{ hashFiles('R-package/DESCRIPTION') }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
- name: Install dependencies
shell: Rscript {0}
@@ -90,7 +90,7 @@ jobs:
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- name: Install igraph on Windows
shell: Rscript {0}
if: matrix.config.os == 'windows-latest'
if: matrix.config.os == 'windows-2016'
run: |
install.packages('igraph', type='binary', dependencies = c('Depends', 'Imports', 'LinkingTo'))
@@ -131,8 +131,8 @@ jobs:
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-3-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-3-${{ hashFiles('R-package/DESCRIPTION') }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
- name: Install dependencies
shell: Rscript {0}

6
.gitignore vendored
View File

@@ -63,7 +63,6 @@ nb-configuration*
# Eclipse
.project
.cproject
.classpath
.pydevproject
.settings/
build
@@ -126,8 +125,3 @@ credentials.csv
*.pub
*.rdp
*_rsa
# Visual Studio code + extensions
.vscode
.metals
.bloop

View File

@@ -1,5 +1,5 @@
cmake_minimum_required(VERSION 3.14 FATAL_ERROR)
project(xgboost LANGUAGES CXX C VERSION 1.6.0)
project(xgboost LANGUAGES CXX C VERSION 1.5.1)
include(cmake/Utils.cmake)
list(APPEND CMAKE_MODULE_PATH "${xgboost_SOURCE_DIR}/cmake/modules")
cmake_policy(SET CMP0022 NEW)
@@ -28,7 +28,6 @@ set_default_configuration_release()
option(BUILD_C_DOC "Build documentation for C APIs using Doxygen." OFF)
option(USE_OPENMP "Build with OpenMP support." ON)
option(BUILD_STATIC_LIB "Build static library" OFF)
option(FORCE_SHARED_CRT "Build with dynamic CRT on Windows (/MD)" OFF)
option(RABIT_BUILD_MPI "Build MPI" OFF)
## Bindings
option(JVM_BINDINGS "Build JVM bindings" OFF)
@@ -138,7 +137,7 @@ if (USE_CUDA)
add_subdirectory(${PROJECT_SOURCE_DIR}/gputreeshap)
if ((${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 11.4) AND (NOT BUILD_WITH_CUDA_CUB))
set(BUILD_WITH_CUDA_CUB ON)
message(SEND_ERROR "`BUILD_WITH_CUDA_CUB` should be set to `ON` for CUDA >= 11.4")
endif ()
endif (USE_CUDA)
@@ -165,9 +164,6 @@ endif (USE_NCCL)
# dmlc-core
msvc_use_static_runtime()
if (FORCE_SHARED_CRT)
set(DMLC_FORCE_SHARED_CRT ON)
endif ()
add_subdirectory(${xgboost_SOURCE_DIR}/dmlc-core)
if (MSVC)
@@ -304,7 +300,7 @@ write_basic_package_version_file(
COMPATIBILITY AnyNewerVersion)
install(
FILES
${CMAKE_CURRENT_BINARY_DIR}/cmake/xgboost-config.cmake
${CMAKE_BINARY_DIR}/cmake/xgboost-config.cmake
${CMAKE_BINARY_DIR}/cmake/xgboost-config-version.cmake
DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/xgboost)

View File

@@ -10,8 +10,8 @@ The Project Management Committee(PMC) consists group of active committers that m
- Tianqi is a Ph.D. student working on large-scale machine learning. He is the creator of the project.
* [Michael Benesty](https://github.com/pommedeterresautee)
- Michael is a lawyer and data scientist in France. He is the creator of XGBoost interactive analysis module in R.
* [Yuan Tang](https://github.com/terrytangyuan), Akuity
- Yuan is a founding engineer at Akuity. He contributed mostly in R and Python packages.
* [Yuan Tang](https://github.com/terrytangyuan), Ant Group
- Yuan is a software engineer in Ant Group. He contributed mostly in R and Python packages.
* [Nan Zhu](https://github.com/CodingCat), Uber
- Nan is a software engineer in Uber. He contributed mostly in JVM packages.
* [Jiaming Yuan](https://github.com/trivialfis)

46
Jenkinsfile vendored
View File

@@ -7,7 +7,7 @@
dockerRun = 'tests/ci_build/ci_build.sh'
// Which CUDA version to use when building reference distribution wheel
ref_cuda_ver = '11.0'
ref_cuda_ver = '10.1'
import groovy.transform.Field
@@ -58,12 +58,14 @@ pipeline {
'build-cpu': { BuildCPU() },
'build-cpu-arm64': { BuildCPUARM64() },
'build-cpu-rabit-mock': { BuildCPUMock() },
// Build reference, distribution-ready Python wheel with CUDA 11.0
// Build reference, distribution-ready Python wheel with CUDA 10.1
// using CentOS 7 image
'build-gpu-cuda10.1': { BuildCUDA(cuda_version: '10.1') },
// The build-gpu-* builds below use Ubuntu image
'build-gpu-cuda11.0': { BuildCUDA(cuda_version: '11.0', build_rmm: true) },
'build-gpu-rpkg': { BuildRPackageWithCUDA(cuda_version: '11.0') },
'build-jvm-packages-gpu-cuda11.0': { BuildJVMPackagesWithCUDA(spark_version: '3.0.1', cuda_version: '11.0') },
'build-jvm-packages': { BuildJVMPackages(spark_version: '3.0.1') },
'build-gpu-rpkg': { BuildRPackageWithCUDA(cuda_version: '10.1') },
'build-jvm-packages-gpu-cuda10.1': { BuildJVMPackagesWithCUDA(spark_version: '3.0.0', cuda_version: '11.0') },
'build-jvm-packages': { BuildJVMPackages(spark_version: '3.0.0') },
'build-jvm-doc': { BuildJVMDoc() }
])
}
@@ -77,10 +79,13 @@ pipeline {
'test-python-cpu': { TestPythonCPU() },
'test-python-cpu-arm64': { TestPythonCPUARM64() },
// artifact_cuda_version doesn't apply to RMM tests; RMM tests will always match CUDA version between artifact and host env
'test-python-gpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0', test_rmm: true) },
'test-python-mgpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0', multi_gpu: true, test_rmm: true) },
'test-python-gpu-cuda11.0-cross': { TestPythonGPU(artifact_cuda_version: '10.1', host_cuda_version: '11.0', test_rmm: true) },
'test-python-gpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0') },
'test-python-mgpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '10.1', host_cuda_version: '11.0', multi_gpu: true, test_rmm: true) },
'test-cpp-gpu-cuda11.0': { TestCppGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0', test_rmm: true) },
'test-jvm-jdk8': { CrossTestJVMwithJDK(jdk_version: '8', spark_version: '3.0.0') }
'test-jvm-jdk8': { CrossTestJVMwithJDK(jdk_version: '8', spark_version: '3.0.0') },
'test-jvm-jdk11': { CrossTestJVMwithJDK(jdk_version: '11') },
'test-jvm-jdk12': { CrossTestJVMwithJDK(jdk_version: '12') }
])
}
}
@@ -123,9 +128,9 @@ def ClangTidy() {
echo "Running clang-tidy job..."
def container_type = "clang_tidy"
def docker_binary = "docker"
def dockerArgs = "--build-arg CUDA_VERSION_ARG=11.0"
def dockerArgs = "--build-arg CUDA_VERSION_ARG=10.1"
sh """
${dockerRun} ${container_type} ${docker_binary} ${dockerArgs} python3 tests/ci_build/tidy.py --cuda-archs 75
${dockerRun} ${container_type} ${docker_binary} ${dockerArgs} python3 tests/ci_build/tidy.py
"""
deleteDir()
}
@@ -179,9 +184,8 @@ def BuildCPUARM64() {
stash name: "xgboost_whl_arm64_cpu", includes: 'python-package/dist/*.whl'
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading Python wheel...'
sh """
${dockerRun} ${container_type} ${docker_binary} bash -c "source activate aarch64_test && python -m awscli s3 cp python-package/dist/*.whl s3://xgboost-nightly-builds/${BRANCH_NAME}/ --acl public-read --no-progress"
"""
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl'
}
stash name: 'xgboost_cli_arm64', includes: 'xgboost'
deleteDir()
@@ -233,9 +237,8 @@ def BuildCUDA(args) {
stash name: "xgboost_whl_cuda${args.cuda_version}", includes: 'python-package/dist/*.whl'
if (args.cuda_version == ref_cuda_ver && (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release'))) {
echo 'Uploading Python wheel...'
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python -m awscli s3 cp python-package/dist/*.whl s3://xgboost-nightly-builds/${BRANCH_NAME}/ --acl public-read --no-progress
"""
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl'
}
echo 'Stashing C++ test executable (testxgboost)...'
stash name: "xgboost_cpp_tests_cuda${args.cuda_version}", includes: 'build/testxgboost'
@@ -270,9 +273,8 @@ def BuildRPackageWithCUDA(args) {
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_r_pkg_with_cuda.sh ${commit_id}
"""
echo 'Uploading R tarball...'
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python -m awscli s3 cp xgboost_r_gpu_linux_*.tar.gz s3://xgboost-nightly-builds/${BRANCH_NAME}/ --acl public-read --no-progress
"""
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', includePathPattern:'xgboost_r_gpu_linux_*.tar.gz'
}
deleteDir()
}
@@ -328,9 +330,7 @@ def BuildJVMDoc() {
"""
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading doc...'
sh """
${dockerRun} ${container_type} ${docker_binary} python -m awscli s3 cp jvm-packages/${BRANCH_NAME}.tar.bz2 s3://xgboost-docs/${BRANCH_NAME}.tar.bz2 --acl public-read --no-progress
"""
s3Upload file: "jvm-packages/${BRANCH_NAME}.tar.bz2", bucket: 'xgboost-docs', acl: 'PublicRead', path: "${BRANCH_NAME}.tar.bz2"
}
deleteDir()
}
@@ -445,7 +445,7 @@ def DeployJVMPackages(args) {
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Deploying to xgboost-maven-repo S3 repo...'
sh """
${dockerRun} jvm_gpu_build docker --build-arg CUDA_VERSION_ARG=11.0 tests/ci_build/deploy_jvm_packages.sh ${args.spark_version}
${dockerRun} jvm_gpu_build docker --build-arg CUDA_VERSION_ARG=10.1 tests/ci_build/deploy_jvm_packages.sh ${args.spark_version}
"""
}
deleteDir()

View File

@@ -40,8 +40,8 @@ pipeline {
steps {
script {
parallel ([
'build-win64-cuda11.0': { BuildWin64() },
'build-rpkg-win64-cuda11.0': { BuildRPackageWithCUDAWin64() }
'build-win64-cuda10.1': { BuildWin64() },
'build-rpkg-win64-cuda10.1': { BuildRPackageWithCUDAWin64() }
])
}
}
@@ -51,7 +51,7 @@ pipeline {
steps {
script {
parallel ([
'test-win64-cuda11.0': { TestWin64() },
'test-win64-cuda10.1': { TestWin64() },
])
}
}
@@ -75,7 +75,7 @@ def checkoutSrcs() {
}
def BuildWin64() {
node('win64 && cuda11_unified') {
node('win64 && cuda10_unified') {
deleteDir()
unstash name: 'srcs'
echo "Building XGBoost for Windows AMD64 target..."
@@ -107,7 +107,7 @@ def BuildWin64() {
stash name: 'xgboost_whl', includes: 'python-package/dist/*.whl'
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading Python wheel...'
path = "${BRANCH_NAME}/"
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl'
}
echo 'Stashing C++ test executable (testxgboost)...'
@@ -118,7 +118,7 @@ def BuildWin64() {
}
def BuildRPackageWithCUDAWin64() {
node('win64 && cuda11_unified') {
node('win64 && cuda10_unified') {
deleteDir()
unstash name: 'srcs'
bat "nvcc --version"
@@ -127,7 +127,7 @@ def BuildRPackageWithCUDAWin64() {
bash tests/ci_build/build_r_pkg_with_cuda_win64.sh ${commit_id}
"""
echo 'Uploading R tarball...'
path = "${BRANCH_NAME}/"
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', includePathPattern:'xgboost_r_gpu_win64_*.tar.gz'
}
deleteDir()
@@ -135,7 +135,7 @@ def BuildRPackageWithCUDAWin64() {
}
def TestWin64() {
node('win64 && cuda11_unified') {
node('win64 && cuda10_unified') {
deleteDir()
unstash name: 'srcs'
unstash name: 'xgboost_whl'

View File

@@ -93,14 +93,11 @@ mypy:
cd python-package; \
mypy ./xgboost/dask.py && \
mypy ./xgboost/rabit.py && \
mypy ./xgboost/tracker.py && \
mypy ./xgboost/sklearn.py && \
mypy ../demo/guide-python/external_memory.py && \
mypy ../demo/guide-python/categorical.py && \
mypy ../demo/guide-python/cat_in_the_dat.py && \
mypy ../tests/python-gpu/test_gpu_with_dask.py && \
mypy ../tests/python/test_data_iterator.py && \
mypy ../tests/python-gpu/test_gpu_data_iterator.py || exit 1; \
mypy ../tests/python-gpu/test_gpu_data_iterator.py && \
mypy ./xgboost/sklearn.py || exit 1; \
mypy . || true ;
clean:
@@ -153,7 +150,6 @@ Rpack: clean_all
bash R-package/remove_warning_suppression_pragma.sh
bash xgboost/remove_warning_suppression_pragma.sh
rm xgboost/remove_warning_suppression_pragma.sh
rm xgboost/CMakeLists.txt
rm -rfv xgboost/tests/helper_scripts/
R ?= R

235
NEWS.md
View File

@@ -3,241 +3,6 @@ XGBoost Change Log
This file records the changes in xgboost library in reverse chronological order.
## v1.5.0 (2021 Oct 11)
This release comes with many exciting new features and optimizations, along with some bug
fixes. We will describe the experimental categorical data support and the external memory
interface independently. Package-specific new features will be listed in respective
sections.
### Development on categorical data support
In version 1.3, XGBoost introduced an experimental feature for handling categorical data
natively, without one-hot encoding. XGBoost can fit categorical splits in decision
trees. (Currently, the generated splits will be of form `x \in {v}`, where the input is
compared to a single category value. A future version of XGBoost will generate splits that
compare the input against a list of multiple category values.)
Most of the other features, including prediction, SHAP value computation, feature
importance, and model plotting were revised to natively handle categorical splits. Also,
all Python interfaces including native interface with and without quantized `DMatrix`,
scikit-learn interface, and Dask interface now accept categorical data with a wide range
of data structures support including numpy/cupy array and cuDF/pandas/modin dataframe. In
practice, the following are required for enabling categorical data support during
training:
- Use Python package.
- Use `gpu_hist` to train the model.
- Use JSON model file format for saving the model.
Once the model is trained, it can be used with most of the features that are available on
the Python package. For a quick introduction, see
https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html
Related PRs: (#7011, #7001, #7042, #7041, #7047, #7043, #7036, #7054, #7053, #7065, #7213, #7228, #7220, #7221, #7231, #7306)
* Next steps
- Revise the CPU training algorithm to handle categorical data natively and generate categorical splits
- Extend the CPU and GPU algorithms to generate categorical splits of form `x \in S`
where the input is compared with multiple category values. split. (#7081)
### External memory
This release features a brand-new interface and implementation for external memory (also
known as out-of-core training). (#6901, #7064, #7088, #7089, #7087, #7092, #7070,
#7216). The new implementation leverages the data iterator interface, which is currently
used to create `DeviceQuantileDMatrix`. For a quick introduction, see
https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html#data-iterator
. During the development of this new interface, `lz4` compression is removed. (#7076).
Please note that external memory support is still experimental and not ready for
production use yet. All future development will focus on this new interface and users are
advised to migrate. (You are using the old interface if you are using a URL suffix to use
external memory.)
### New features in Python package
* Support numpy array interface and all numeric types from numpy in `DMatrix`
construction and `inplace_predict` (#6998, #7003). Now XGBoost no longer makes data
copy when input is numpy array view.
* The early stopping callback in Python has a new `min_delta` parameter to control the
stopping behavior (#7137)
* Python package now supports calculating feature scores for the linear model, which is
also available on R package. (#7048)
* Python interface now supports configuring constraints using feature names instead of
feature indices.
* Typehint support for more Python code including scikit-learn interface and rabit
module. (#6799, #7240)
* Add tutorial for XGBoost-Ray (#6884)
### New features in R package
* In 1.4 we have a new prediction function in the C API which is used by the Python
package. This release revises the R package to use the new prediction function as well.
A new parameter `iteration_range` for the predict function is available, which can be
used for specifying the range of trees for running prediction. (#6819, #7126)
* R package now supports the `nthread` parameter in `DMatrix` construction. (#7127)
### New features in JVM packages
* Support GPU dataframe and `DeviceQuantileDMatrix` (#7195). Constructing `DMatrix`
with GPU data structures and the interface for quantized `DMatrix` were first
introduced in the Python package and are now available in the xgboost4j package.
* JVM packages now support saving and getting early stopping attributes. (#7095) Here is a
quick [example](https://github.com/dmlc/xgboost/jvm-packages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/EarlyStopping.java "example") in JAVA (#7252).
### General new features
* We now have a pre-built binary package for R on Windows with GPU support. (#7185)
* CUDA compute capability 86 is now part of the default CMake build configuration with
newly added support for CUDA 11.4. (#7131, #7182, #7254)
* XGBoost can be compiled using system CUB provided by CUDA 11.x installation. (#7232)
### Optimizations
The performance for both `hist` and `gpu_hist` has been significantly improved in 1.5
with the following optimizations:
* GPU multi-class model training now supports prediction cache. (#6860)
* GPU histogram building is sped up and the overall training time is 2-3 times faster on
large datasets (#7180, #7198). In addition, we removed the parameter `deterministic_histogram` and now
the GPU algorithm is always deterministic.
* CPU hist has an optimized procedure for data sampling (#6922)
* More performance optimization in regression and binary classification objectives on
CPU (#7206)
* Tree model dump is now performed in parallel (#7040)
### Breaking changes
* `n_gpus` was deprecated in 1.0 release and is now removed.
* Feature grouping in CPU hist tree method is removed, which was disabled long
ago. (#7018)
* C API for Quantile DMatrix is changed to be consistent with the new external memory
implementation. (#7082)
### Notable general bug fixes
* XGBoost no long changes global CUDA device ordinal when `gpu_id` is specified (#6891,
#6987)
* Fix `gamma` negative likelihood evaluation metric. (#7275)
* Fix integer value of `verbose_eal` for `xgboost.cv` function in Python. (#7291)
* Remove extra sync in CPU hist for dense data, which can lead to incorrect tree node
statistics. (#7120, #7128)
* Fix a bug in GPU hist when data size is larger than `UINT32_MAX` with missing
values. (#7026)
* Fix a thread safety issue in prediction with the `softmax` objective. (#7104)
* Fix a thread safety issue in CPU SHAP value computation. (#7050) Please note that all
prediction functions in Python are thread-safe.
* Fix model slicing. (#7149, #7078)
* Workaround a bug in old GCC which can lead to segfault during construction of
DMatrix. (#7161)
* Fix histogram truncation in GPU hist, which can lead to slightly-off results. (#7181)
* Fix loading GPU linear model pickle files on CPU-only machine. (#7154)
* Check input value is duplicated when CPU quantile queue is full (#7091)
* Fix parameter loading with training continuation. (#7121)
* Fix CMake interface for exposing C library by specifying dependencies. (#7099)
* Callback and early stopping are explicitly disabled for the scikit-learn interface
random forest estimator. (#7236)
* Fix compilation error on x86 (32-bit machine) (#6964)
* Fix CPU memory usage with extremely sparse datasets (#7255)
* Fix a bug in GPU multi-class AUC implementation with weighted data (#7300)
### Python package
Other than the items mentioned in the previous sections, there are some Python-specific
improvements.
* Change development release postfix to `dev` (#6988)
* Fix early stopping behavior with MAPE metric (#7061)
* Fixed incorrect feature mismatch error message (#6949)
* Add predictor to skl constructor. (#7000, #7159)
* Re-enable feature validation in predict proba. (#7177)
* scikit learn interface regression estimator now can pass the scikit-learn estimator
check and is fully compatible with scikit-learn utilities. `__sklearn_is_fitted__` is
implemented as part of the changes (#7130, #7230)
* Conform the latest pylint. (#7071, #7241)
* Support latest panda range index in DMatrix construction. (#7074)
* Fix DMatrix construction from pandas series. (#7243)
* Fix typo and grammatical mistake in error message (#7134)
* [dask] disable work stealing explicitly for training tasks (#6794)
* [dask] Set dataframe index in predict. (#6944)
* [dask] Fix prediction on df with latest dask. (#6969)
* [dask] Fix dask predict on `DaskDMatrix` with `iteration_range`. (#7005)
* [dask] Disallow importing non-dask estimators from xgboost.dask (#7133)
### R package
Improvements other than new features on R package:
* Optimization for updating R handles in-place (#6903)
* Removed the magrittr dependency. (#6855, #6906, #6928)
* The R package now hides all C++ symbols to avoid conflicts. (#7245)
* Other maintenance including code cleanups, document updates. (#6863, #6915, #6930, #6966, #6967)
### JVM packages
Improvements other than new features on JVM packages:
* Constructors with implicit missing value are deprecated due to confusing behaviors. (#7225)
* Reduce scala-compiler, scalatest dependency scopes (#6730)
* Making the Java library loader emit helpful error messages on missing dependencies. (#6926)
* JVM packages now use the Python tracker in XGBoost instead of dmlc. The one in XGBoost
is shared between JVM packages and Python Dask and enjoys better maintenance (#7132)
* Fix "key not found: train" error (#6842)
* Fix model loading from stream (#7067)
### General document improvements
* Overhaul the installation documents. (#6877)
* A few demos are added for AFT with dask (#6853), callback with dask (#6995), inference
in C (#7151), `process_type`. (#7135)
* Fix PDF format of document. (#7143)
* Clarify the behavior of `use_rmm`. (#6808)
* Clarify prediction function. (#6813)
* Improve tutorial on feature interactions (#7219)
* Add small example for dask sklearn interface. (#6970)
* Update Python intro. (#7235)
* Some fixes/updates (#6810, #6856, #6935, #6948, #6976, #7084, #7097, #7170, #7173, #7174, #7226, #6979, #6809, #6796, #6979)
### Maintenance
* Some refactoring around CPU hist, which lead to better performance but are listed under general maintenance tasks:
- Extract evaluate splits from CPU hist. (#7079)
- Merge lossgude and depthwise strategies for CPU hist (#7007)
- Simplify sparse and dense CPU hist kernels (#7029)
- Extract histogram builder from CPU Hist. (#7152)
* Others
- Fix `gpu_id` with custom objective. (#7015)
- Fix typos in AUC. (#6795)
- Use constexpr in `dh::CopyIf`. (#6828)
- Update dmlc-core. (#6862)
- Bump version to 1.5.0 snapshot in master. (#6875)
- Relax shotgun test. (#6900)
- Guard against index error in prediction. (#6982)
- Hide symbols in CI build + hide symbols for C and CUDA (#6798)
- Persist data in dask test. (#7077)
- Fix typo in arguments of PartitionBuilder::Init (#7113)
- Fix typo in src/common/hist.cc BuildHistKernel (#7116)
- Use upstream URI in distributed quantile tests. (#7129)
- Include cpack (#7160)
- Remove synchronization in monitor. (#7164)
- Remove unused code. (#7175)
- Fix building on CUDA 11.0. (#7187)
- Better error message for `ncclUnhandledCudaError`. (#7190)
- Add noexcept to JSON objects. (#7205)
- Improve wording for warning (#7248)
- Fix typo in release script. [skip ci] (#7238)
- Relax shotgun test. (#6918)
- Relax test for decision stump in distributed environment. (#6919)
- [dask] speed up tests (#7020)
### CI
* [CI] Rotate access keys for uploading MacOS artifacts from Travis CI (#7253)
* Reduce Travis environment setup time. (#6912)
* Restore R cache on github action. (#6985)
* [CI] Remove stray build artifact to avoid error in artifact packaging (#6994)
* [CI] Move appveyor tests to action (#6986)
* Remove appveyor badge. [skip ci] (#7035)
* [CI] Configure RAPIDS, dask, modin (#7033)
* Test on s390x. (#7038)
* [CI] Upgrade to CMake 3.14 (#7060)
* [CI] Update R cache. (#7102)
* [CI] Pin libomp to 11.1.0 (#7107)
* [CI] Upgrade build image to CentOS 7 + GCC 8; require CUDA 10.1 and later (#7141)
* [dask] Work around segfault in prediction. (#7112)
* [dask] Remove the workaround for segfault. (#7146)
* [CI] Fix hanging Python setup in Windows CI (#7186)
* [CI] Clean up in beginning of each task in Win CI (#7189)
* Fix travis. (#7237)
### Acknowledgement
* **Contributors**: Adam Pocock (@Craigacp), Jeff H (@JeffHCross), Johan Hansson (@JohanWork), Jose Manuel Llorens (@JoseLlorensRipolles), Benjamin Szőke (@Livius90), @ReeceGoding, @ShvetsKS, Robert Zabel (@ZabelTech), Ali (@ali5h), Andrew Ziem (@az0), Andy Adinets (@canonizer), @david-cortes, Daniel Saxton (@dsaxton), Emil Sadek (@esadek), @farfarawayzyt, Gil Forsyth (@gforsyth), @giladmaya, @graue70, Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), José Morales (@jmoralez), Kai Fricke (@krfricke), Christian Lorentzen (@lorentzenchr), Mads R. B. Kristensen (@madsbk), Anton Kostin (@masguit42), Martin Petříček (@mpetricek-corp), @naveenkb, Taewoo Kim (@oOTWK), Viktor Szathmáry (@phraktle), Robert Maynard (@robertmaynard), TP Boudreau (@tpboudreau), Jiaming Yuan (@trivialfis), Paul Taylor (@trxcllnt), @vslaykovsky, Bobby Wang (@wbo4958),
* **Reviewers**: Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Jose Manuel Llorens (@JoseLlorensRipolles), Kodi Arfer (@Kodiologist), Benjamin Szőke (@Livius90), Mark Guryanov (@MarkGuryanov), Rory Mitchell (@RAMitchell), @ReeceGoding, @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Andrew Ziem (@az0), @candalfigomoro, Andy Adinets (@canonizer), Dante Gama Dessavre (@dantegd), @david-cortes, Daniel Saxton (@dsaxton), @farfarawayzyt, Gil Forsyth (@gforsyth), Harutaka Kawamura (@harupy), Philip Hyunsu Cho (@hcho3), @jakirkham, James Lamb (@jameslamb), José Morales (@jmoralez), James Bourbeau (@jrbourbeau), Christian Lorentzen (@lorentzenchr), Martin Petříček (@mpetricek-corp), Nikolay Petrov (@napetrov), @naveenkb, Viktor Szathmáry (@phraktle), Robin Teuwens (@rteuwens), Yuan Tang (@terrytangyuan), TP Boudreau (@tpboudreau), Jiaming Yuan (@trivialfis), @vkuzmin-uber, Bobby Wang (@wbo4958), William Hicks (@wphicks)
## v1.4.2 (2021.05.13)
This is a patch release for Python package with following fixes:

View File

@@ -1,12 +1,12 @@
Package: xgboost
Type: Package
Title: Extreme Gradient Boosting
Version: 1.6.0.1
Date: 2022-03-29
Version: 1.5.1.1
Date: 2021-10-13
Authors@R: c(
person("Tianqi", "Chen", role = c("aut"),
email = "tianqi.tchen@gmail.com"),
person("Tong", "He", role = c("aut"),
person("Tong", "He", role = c("aut", "cre"),
email = "hetong007@gmail.com"),
person("Michael", "Benesty", role = c("aut"),
email = "michael@benesty.fr"),
@@ -26,12 +26,9 @@ Authors@R: c(
person("Min", "Lin", role = c("aut")),
person("Yifeng", "Geng", role = c("aut")),
person("Yutian", "Li", role = c("aut")),
person("Jiaming", "Yuan", role = c("aut", "cre"),
email = "jm.yuan@outlook.com"),
person("XGBoost contributors", role = c("cph"),
comment = "base XGBoost implementation")
)
Maintainer: Jiaming Yuan <jm.yuan@outlook.com>
Description: Extreme Gradient Boosting, which is an efficient implementation
of the gradient boosting framework from Chen & Guestrin (2016) <doi:10.1145/2939672.2939785>.
This package is its R interface. The package includes efficient linear

View File

@@ -162,11 +162,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' Predicted values based on either xgboost model or model handle object.
#'
#' @param object Object of class \code{xgb.Booster} or \code{xgb.Booster.handle}
#' @param newdata takes \code{matrix}, \code{dgCMatrix}, \code{dgRMatrix}, \code{dsparseVector},
#' local data file or \code{xgb.DMatrix}.
#'
#' For single-row predictions on sparse data, it's recommended to use CSR format. If passing
#' a sparse vector, it will take it as a row vector.
#' @param newdata takes \code{matrix}, \code{dgCMatrix}, local data file or \code{xgb.DMatrix}.
#' @param missing Missing is only used when input is dense matrix. Pick a float value that represents
#' missing values in data (e.g., sometimes 0 or some other extreme value is used).
#' @param outputmargin whether the prediction should be returned in the for of original untransformed
@@ -184,7 +180,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' training predicting will perform dropout.
#' @param iterationrange Specifies which layer of trees are used in prediction. For
#' example, if a random forest is trained with 100 rounds. Specifying
#' `iterationrange=(1, 21)`, then only the forests built during [1, 21) (half open set)
#' `iteration_range=(1, 21)`, then only the forests built during [1, 21) (half open set)
#' rounds are used in this prediction. It's 1-based index just like R vector. When set
#' to \code{c(1, 1)} XGBoost will use all trees.
#' @param strict_shape Default is \code{FALSE}. When it's set to \code{TRUE}, output
@@ -439,8 +435,7 @@ predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FA
lapply(seq_len(n_groups), function(g) arr[g, , ])
} else {
## remove the first axis (group)
dn <- dimnames(arr)
matrix(arr[1, , ], nrow = dim(arr)[2], ncol = dim(arr)[3], dimnames = c(dn[2], dn[3]))
as.matrix(arr[1, , ])
}
} else if (predinteraction) {
## Predict interaction
@@ -452,8 +447,7 @@ predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FA
lapply(seq_len(n_groups), function(g) arr[g, , , ])
} else {
## remove the first axis (group)
arr <- arr[1, , , , drop = FALSE]
array(arr, dim = dim(arr)[2:4], dimnames(arr)[2:4])
arr[1, , , ]
}
} else {
## Normal prediction

View File

@@ -4,10 +4,8 @@
#' Supported input file formats are either a LIBSVM text file or a binary file that was created previously by
#' \code{\link{xgb.DMatrix.save}}).
#'
#' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object,
#' a \code{dgRMatrix} object (only when making predictions from a fitted model),
#' a \code{dsparseVector} object (only when making predictions from a fitted model, will be
#' interpreted as a row vector), or a character string representing a filename.
#' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
#' string representing a filename.
#' @param info a named list of additional information to store in the \code{xgb.DMatrix} object.
#' See \code{\link{setinfo}} for the specific allowed kinds of
#' @param missing a float value to represents missing values in data (used only when input is a dense matrix).
@@ -35,21 +33,8 @@ xgb.DMatrix <- function(data, info = list(), missing = NA, silent = FALSE, nthre
handle <- .Call(XGDMatrixCreateFromMat_R, data, missing, as.integer(NVL(nthread, -1)))
cnames <- colnames(data)
} else if (inherits(data, "dgCMatrix")) {
handle <- .Call(
XGDMatrixCreateFromCSC_R, data@p, data@i, data@x, nrow(data), as.integer(NVL(nthread, -1))
)
handle <- .Call(XGDMatrixCreateFromCSC_R, data@p, data@i, data@x, nrow(data))
cnames <- colnames(data)
} else if (inherits(data, "dgRMatrix")) {
handle <- .Call(
XGDMatrixCreateFromCSR_R, data@p, data@j, data@x, ncol(data), as.integer(NVL(nthread, -1))
)
cnames <- colnames(data)
} else if (inherits(data, "dsparseVector")) {
indptr <- c(0L, as.integer(length(data@i)))
ind <- as.integer(data@i) - 1L
handle <- .Call(
XGDMatrixCreateFromCSR_R, indptr, ind, data@x, length(data), as.integer(NVL(nthread, -1))
)
} else {
stop("xgb.DMatrix does not support construction from ", typeof(data))
}
@@ -287,13 +272,6 @@ setinfo.xgb.DMatrix <- function(object, name, info, ...) {
.Call(XGDMatrixSetInfo_R, object, name, as.integer(info))
return(TRUE)
}
if (name == "feature_weights") {
if (length(info) != ncol(object)) {
stop("The number of feature weights must equal to the number of columns in the input data")
}
.Call(XGDMatrixSetInfo_R, object, name, as.numeric(info))
return(TRUE)
}
stop("setinfo: unknown info name ", name)
return(FALSE)
}

View File

@@ -18,7 +18,7 @@
#'
#' International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
#'
#' \url{https://research.facebook.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
#' \url{https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
#'
#' Extract explaining the method:
#'

View File

@@ -6,6 +6,8 @@
#' @param fname the name of the text file where to save the model text dump.
#' If not provided or set to \code{NULL}, the model is returned as a \code{character} vector.
#' @param fmap feature map file representing feature types.
#' Detailed description could be found at
#' \url{https://github.com/dmlc/xgboost/wiki/Binary-Classification#dump-model}.
#' See demo/ for walkthrough example in R, and
#' \url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt}
#' for example Format.

View File

@@ -5,7 +5,7 @@
#' @param modelfile the name of the binary input file.
#'
#' @details
#' The input file is expected to contain a model saved in an xgboost model format
#' The input file is expected to contain a model saved in an xgboost-internal binary format
#' using either \code{\link{xgb.save}} or \code{\link{cb.save.model}} in R, or using some
#' appropriate methods from other xgboost interfaces. E.g., a model trained in Python and
#' saved from there in xgboost format, could be loaded from R.
@@ -38,13 +38,6 @@ xgb.load <- function(modelfile) {
handle <- xgb.Booster.handle(modelfile = modelfile)
# re-use modelfile if it is raw so we do not need to serialize
if (typeof(modelfile) == "raw") {
warning(
paste(
"The support for loading raw booster with `xgb.load` will be ",
"discontinued in upcoming release. Use `xgb.load.raw` or",
" `xgb.unserialize` instead. "
)
)
bst <- xgb.handleToBooster(handle, modelfile)
} else {
bst <- xgb.handleToBooster(handle, NULL)

View File

@@ -3,21 +3,12 @@
#' User can generate raw memory buffer by calling xgb.save.raw
#'
#' @param buffer the buffer returned by xgb.save.raw
#' @param as_booster Return the loaded model as xgb.Booster instead of xgb.Booster.handle.
#'
#' @export
xgb.load.raw <- function(buffer, as_booster = FALSE) {
xgb.load.raw <- function(buffer) {
cachelist <- list()
handle <- .Call(XGBoosterCreate_R, cachelist)
.Call(XGBoosterLoadModelFromRaw_R, handle, buffer)
class(handle) <- "xgb.Booster.handle"
if (as_booster) {
booster <- list(handle = handle, raw = NULL)
class(booster) <- "xgb.Booster"
booster <- xgb.Booster.complete(booster, saveraw = TRUE)
return(booster)
} else {
return (handle)
}
return (handle)
}

View File

@@ -87,7 +87,7 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
}
if (length(text) < 2 ||
sum(grepl('leaf=(\\d+)', text)) < 1) {
sum(grepl('yes=(\\d+),no=(\\d+)', text)) < 1) {
stop("Non-tree model detected! This function can only be used with tree models.")
}
@@ -116,28 +116,16 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
branch_rx <- paste0("f(\\d+)<(", anynumber_regex, ")\\] yes=(\\d+),no=(\\d+),missing=(\\d+),",
"gain=(", anynumber_regex, "),cover=(", anynumber_regex, ")")
branch_cols <- c("Feature", "Split", "Yes", "No", "Missing", "Quality", "Cover")
td[
isLeaf == FALSE,
(branch_cols) := {
matches <- regmatches(t, regexec(branch_rx, t))
# skip some indices with spurious capture groups from anynumber_regex
xtr <- do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]
xtr[, 3:5] <- add.tree.id(xtr[, 3:5], Tree)
if (length(xtr) == 0) {
as.data.table(
list(Feature = "NA", Split = "NA", Yes = "NA", No = "NA", Missing = "NA", Quality = "NA", Cover = "NA")
)
} else {
as.data.table(xtr)
}
}
]
td[isLeaf == FALSE,
(branch_cols) := {
matches <- regmatches(t, regexec(branch_rx, t))
# skip some indices with spurious capture groups from anynumber_regex
xtr <- do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]
xtr[, 3:5] <- add.tree.id(xtr[, 3:5], Tree)
as.data.table(xtr)
}]
# assign feature_names when available
is_stump <- function() {
return(length(td$Feature) == 1 && is.na(td$Feature))
}
if (!is.null(feature_names) && !is_stump()) {
if (!is.null(feature_names)) {
if (length(feature_names) <= max(as.numeric(td$Feature), na.rm = TRUE))
stop("feature_names has less elements than there are features used in the model")
td[isLeaf == FALSE, Feature := feature_names[as.numeric(Feature) + 1]]
@@ -146,18 +134,12 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
# parse leaf lines
leaf_rx <- paste0("leaf=(", anynumber_regex, "),cover=(", anynumber_regex, ")")
leaf_cols <- c("Feature", "Quality", "Cover")
td[
isLeaf == TRUE,
(leaf_cols) := {
matches <- regmatches(t, regexec(leaf_rx, t))
xtr <- do.call(rbind, matches)[, c(2, 4)]
if (length(xtr) == 2) {
c("Leaf", as.data.table(xtr[1]), as.data.table(xtr[2]))
} else {
c("Leaf", as.data.table(xtr))
}
}
]
td[isLeaf == TRUE,
(leaf_cols) := {
matches <- regmatches(t, regexec(leaf_rx, t))
xtr <- do.call(rbind, matches)[, c(2, 4)]
c("Leaf", as.data.table(xtr))
}]
# convert some columns to numeric
numeric_cols <- c("Split", "Quality", "Cover")

View File

@@ -98,22 +98,18 @@ xgb.plot.tree <- function(feature_names = NULL, model = NULL, trees = NULL, plot
data = dt$Feature,
fontcolor = "black")
if (nrow(dt[Feature != "Leaf"]) != 0) {
edges <- DiagrammeR::create_edge_df(
from = match(rep(dt[Feature != "Leaf", c(ID)], 2), dt$ID),
to = match(dt[Feature != "Leaf", c(Yes, No)], dt$ID),
label = c(
dt[Feature != "Leaf", paste("<", Split)],
rep("", nrow(dt[Feature != "Leaf"]))
),
style = c(
dt[Feature != "Leaf", ifelse(Missing == Yes, "bold", "solid")],
dt[Feature != "Leaf", ifelse(Missing == No, "bold", "solid")]
),
rel = "leading_to")
} else {
edges <- NULL
}
edges <- DiagrammeR::create_edge_df(
from = match(rep(dt[Feature != "Leaf", c(ID)], 2), dt$ID),
to = match(dt[Feature != "Leaf", c(Yes, No)], dt$ID),
label = c(
dt[Feature != "Leaf", paste("<", Split)],
rep("", nrow(dt[Feature != "Leaf"]))
),
style = c(
dt[Feature != "Leaf", ifelse(Missing == Yes, "bold", "solid")],
dt[Feature != "Leaf", ifelse(Missing == No, "bold", "solid")]
),
rel = "leading_to")
graph <- DiagrammeR::create_graph(
nodes_df = nodes,

View File

@@ -4,14 +4,6 @@
#' Save xgboost model from xgboost or xgb.train
#'
#' @param model the model object.
#' @param raw_format The format for encoding the booster. Available options are
#' \itemize{
#' \item \code{json}: Encode the booster into JSON text document.
#' \item \code{ubj}: Encode the booster into Universal Binary JSON.
#' \item \code{deprecated}: Encode the booster into old customized binary format.
#' }
#'
#' Right now the default is \code{deprecated} but will be changed to \code{ubj} in upcoming release.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
@@ -25,8 +17,7 @@
#' pred <- predict(bst, test$data)
#'
#' @export
xgb.save.raw <- function(model, raw_format = "deprecated") {
xgb.save.raw <- function(model) {
handle <- xgb.get.handle(model)
args <- list(format = raw_format)
.Call(XGBoosterSaveModelToRaw_R, handle, jsonlite::toJSON(args, auto_unbox = TRUE))
.Call(XGBoosterModelToRaw_R, handle)
}

21
R-package/configure vendored
View File

@@ -1,6 +1,6 @@
#! /bin/sh
# Guess values for system-dependent variables and create Makefiles.
# Generated by GNU Autoconf 2.69 for xgboost 1.6-0.
# Generated by GNU Autoconf 2.69 for xgboost 0.6-3.
#
#
# Copyright (C) 1992-1996, 1998-2012 Free Software Foundation, Inc.
@@ -576,8 +576,8 @@ MAKEFLAGS=
# Identity of this package.
PACKAGE_NAME='xgboost'
PACKAGE_TARNAME='xgboost'
PACKAGE_VERSION='1.6-0'
PACKAGE_STRING='xgboost 1.6-0'
PACKAGE_VERSION='0.6-3'
PACKAGE_STRING='xgboost 0.6-3'
PACKAGE_BUGREPORT=''
PACKAGE_URL=''
@@ -1195,7 +1195,7 @@ if test "$ac_init_help" = "long"; then
# Omit some internal or obsolete options to make the list less imposing.
# This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF
\`configure' configures xgboost 1.6-0 to adapt to many kinds of systems.
\`configure' configures xgboost 0.6-3 to adapt to many kinds of systems.
Usage: $0 [OPTION]... [VAR=VALUE]...
@@ -1257,7 +1257,7 @@ fi
if test -n "$ac_init_help"; then
case $ac_init_help in
short | recursive ) echo "Configuration of xgboost 1.6-0:";;
short | recursive ) echo "Configuration of xgboost 0.6-3:";;
esac
cat <<\_ACEOF
@@ -1336,7 +1336,7 @@ fi
test -n "$ac_init_help" && exit $ac_status
if $ac_init_version; then
cat <<\_ACEOF
xgboost configure 1.6-0
xgboost configure 0.6-3
generated by GNU Autoconf 2.69
Copyright (C) 2012 Free Software Foundation, Inc.
@@ -1479,7 +1479,7 @@ cat >config.log <<_ACEOF
This file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.
It was created by xgboost $as_me 1.6-0, which was
It was created by xgboost $as_me 0.6-3, which was
generated by GNU Autoconf 2.69. Invocation command line was
$ $0 $@
@@ -2725,7 +2725,7 @@ main ()
return 0;
}
_ACEOF
${CC} -o conftest conftest.c ${CPPFLAGS} ${LDFLAGS} ${OPENMP_LIB} ${OPENMP_CXXFLAGS} 2>/dev/null && ./conftest && ac_pkg_openmp=yes
${CC} -o conftest conftest.c ${OPENMP_LIB} ${OPENMP_CXXFLAGS} 2>/dev/null && ./conftest && ac_pkg_openmp=yes
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${ac_pkg_openmp}" >&5
$as_echo "${ac_pkg_openmp}" >&6; }
if test "${ac_pkg_openmp}" = no; then
@@ -3287,7 +3287,7 @@ cat >>$CONFIG_STATUS <<\_ACEOF || ac_write_fail=1
# report actual input values of CONFIG_FILES etc. instead of their
# values after options handling.
ac_log="
This file was extended by xgboost $as_me 1.6-0, which was
This file was extended by xgboost $as_me 0.6-3, which was
generated by GNU Autoconf 2.69. Invocation command line was
CONFIG_FILES = $CONFIG_FILES
@@ -3340,7 +3340,7 @@ _ACEOF
cat >>$CONFIG_STATUS <<_ACEOF || ac_write_fail=1
ac_cs_config="`$as_echo "$ac_configure_args" | sed 's/^ //; s/[\\""\`\$]/\\\\&/g'`"
ac_cs_version="\\
xgboost config.status 1.6-0
xgboost config.status 0.6-3
configured by $0, generated by GNU Autoconf 2.69,
with options \\"\$ac_cs_config\\"
@@ -3900,3 +3900,4 @@ if test -n "$ac_unrecognized_opts" && test "$enable_option_checking" != no; then
$as_echo "$as_me: WARNING: unrecognized options: $ac_unrecognized_opts" >&2;}
fi

View File

@@ -2,7 +2,7 @@
AC_PREREQ(2.69)
AC_INIT([xgboost],[1.6-0],[],[xgboost],[])
AC_INIT([xgboost],[0.6-3],[],[xgboost],[])
# Use this line to set CC variable to a C compiler
AC_PROG_CC
@@ -33,7 +33,7 @@ then
ac_pkg_openmp=no
AC_MSG_CHECKING([whether OpenMP will work in a package])
AC_LANG_CONFTEST([AC_LANG_PROGRAM([[#include <omp.h>]], [[ return (omp_get_max_threads() <= 1); ]])])
${CC} -o conftest conftest.c ${CPPFLAGS} ${LDFLAGS} ${OPENMP_LIB} ${OPENMP_CXXFLAGS} 2>/dev/null && ./conftest && ac_pkg_openmp=yes
${CC} -o conftest conftest.c ${OPENMP_LIB} ${OPENMP_CXXFLAGS} 2>/dev/null && ./conftest && ac_pkg_openmp=yes
AC_MSG_RESULT([${ac_pkg_openmp}])
if test "${ac_pkg_openmp}" = no; then
OPENMP_CXXFLAGS=''

View File

@@ -63,7 +63,7 @@ print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred))))
# save model to R's raw vector
raw <- xgb.save.raw(bst)
# load binary model to R
bst3 <- xgb.load.raw(raw)
bst3 <- xgb.load(raw)
pred3 <- predict(bst3, test$data)
# pred3 should be identical to pred
print(paste("sum(abs(pred3-pred))=", sum(abs(pred3 - pred))))

View File

@@ -27,11 +27,7 @@
\arguments{
\item{object}{Object of class \code{xgb.Booster} or \code{xgb.Booster.handle}}
\item{newdata}{takes \code{matrix}, \code{dgCMatrix}, \code{dgRMatrix}, \code{dsparseVector},
local data file or \code{xgb.DMatrix}.
For single-row predictions on sparse data, it's recommended to use CSR format. If passing
a sparse vector, it will take it as a row vector.}
\item{newdata}{takes \code{matrix}, \code{dgCMatrix}, local data file or \code{xgb.DMatrix}.}
\item{missing}{Missing is only used when input is dense matrix. Pick a float value that represents
missing values in data (e.g., sometimes 0 or some other extreme value is used).}
@@ -59,7 +55,7 @@ training predicting will perform dropout.}
\item{iterationrange}{Specifies which layer of trees are used in prediction. For
example, if a random forest is trained with 100 rounds. Specifying
`iterationrange=(1, 21)`, then only the forests built during [1, 21) (half open set)
`iteration_range=(1, 21)`, then only the forests built during [1, 21) (half open set)
rounds are used in this prediction. It's 1-based index just like R vector. When set
to \code{c(1, 1)} XGBoost will use all trees.}

View File

@@ -14,10 +14,8 @@ xgb.DMatrix(
)
}
\arguments{
\item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object,
a \code{dgRMatrix} object (only when making predictions from a fitted model),
a \code{dsparseVector} object (only when making predictions from a fitted model, will be
interpreted as a row vector), or a character string representing a filename.}
\item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
string representing a filename.}
\item{info}{a named list of additional information to store in the \code{xgb.DMatrix} object.
See \code{\link{setinfo}} for the specific allowed kinds of}

View File

@@ -29,7 +29,7 @@ Joaquin Quinonero Candela)}
International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
\url{https://research.facebook.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
\url{https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
Extract explaining the method:

View File

@@ -20,6 +20,8 @@ xgb.dump(
If not provided or set to \code{NULL}, the model is returned as a \code{character} vector.}
\item{fmap}{feature map file representing feature types.
Detailed description could be found at
\url{https://github.com/dmlc/xgboost/wiki/Binary-Classification#dump-model}.
See demo/ for walkthrough example in R, and
\url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt}
for example Format.}

View File

@@ -16,7 +16,7 @@ An object of \code{xgb.Booster} class.
Load xgboost model from the binary model file.
}
\details{
The input file is expected to contain a model saved in an xgboost model format
The input file is expected to contain a model saved in an xgboost-internal binary format
using either \code{\link{xgb.save}} or \code{\link{cb.save.model}} in R, or using some
appropriate methods from other xgboost interfaces. E.g., a model trained in Python and
saved from there in xgboost format, could be loaded from R.

View File

@@ -4,12 +4,10 @@
\alias{xgb.load.raw}
\title{Load serialised xgboost model from R's raw vector}
\usage{
xgb.load.raw(buffer, as_booster = FALSE)
xgb.load.raw(buffer)
}
\arguments{
\item{buffer}{the buffer returned by xgb.save.raw}
\item{as_booster}{Return the loaded model as xgb.Booster instead of xgb.Booster.handle.}
}
\description{
User can generate raw memory buffer by calling xgb.save.raw

View File

@@ -5,19 +5,10 @@
\title{Save xgboost model to R's raw vector,
user can call xgb.load.raw to load the model back from raw vector}
\usage{
xgb.save.raw(model, raw_format = "deprecated")
xgb.save.raw(model)
}
\arguments{
\item{model}{the model object.}
\item{raw_format}{The format for encoding the booster. Available options are
\itemize{
\item \code{json}: Encode the booster into JSON text document.
\item \code{ubj}: Encode the booster into Universal Binary JSON.
\item \code{deprecated}: Encode the booster into old customized binary format.
}
Right now the default is \code{deprecated} but will be changed to \code{ubj} in upcoming release.}
}
\description{
Save xgboost model from xgboost or xgb.train

View File

@@ -24,12 +24,12 @@ extern SEXP XGBoosterEvalOneIter_R(SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterGetAttrNames_R(SEXP);
extern SEXP XGBoosterGetAttr_R(SEXP, SEXP);
extern SEXP XGBoosterLoadModelFromRaw_R(SEXP, SEXP);
extern SEXP XGBoosterSaveModelToRaw_R(SEXP handle, SEXP config);
extern SEXP XGBoosterLoadModel_R(SEXP, SEXP);
extern SEXP XGBoosterSaveJsonConfig_R(SEXP handle);
extern SEXP XGBoosterLoadJsonConfig_R(SEXP handle, SEXP value);
extern SEXP XGBoosterSerializeToBuffer_R(SEXP handle);
extern SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw);
extern SEXP XGBoosterModelToRaw_R(SEXP);
extern SEXP XGBoosterPredict_R(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterPredictFromDMatrix_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterSaveModel_R(SEXP, SEXP);
@@ -37,8 +37,7 @@ extern SEXP XGBoosterSetAttr_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterSetParam_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterUpdateOneIter_R(SEXP, SEXP, SEXP);
extern SEXP XGCheckNullPtr_R(SEXP);
extern SEXP XGDMatrixCreateFromCSC_R(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP XGDMatrixCreateFromCSR_R(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP XGDMatrixCreateFromCSC_R(SEXP, SEXP, SEXP, SEXP);
extern SEXP XGDMatrixCreateFromFile_R(SEXP, SEXP);
extern SEXP XGDMatrixCreateFromMat_R(SEXP, SEXP, SEXP);
extern SEXP XGDMatrixGetInfo_R(SEXP, SEXP);
@@ -60,12 +59,12 @@ static const R_CallMethodDef CallEntries[] = {
{"XGBoosterGetAttrNames_R", (DL_FUNC) &XGBoosterGetAttrNames_R, 1},
{"XGBoosterGetAttr_R", (DL_FUNC) &XGBoosterGetAttr_R, 2},
{"XGBoosterLoadModelFromRaw_R", (DL_FUNC) &XGBoosterLoadModelFromRaw_R, 2},
{"XGBoosterSaveModelToRaw_R", (DL_FUNC) &XGBoosterSaveModelToRaw_R, 2},
{"XGBoosterLoadModel_R", (DL_FUNC) &XGBoosterLoadModel_R, 2},
{"XGBoosterSaveJsonConfig_R", (DL_FUNC) &XGBoosterSaveJsonConfig_R, 1},
{"XGBoosterLoadJsonConfig_R", (DL_FUNC) &XGBoosterLoadJsonConfig_R, 2},
{"XGBoosterSerializeToBuffer_R", (DL_FUNC) &XGBoosterSerializeToBuffer_R, 1},
{"XGBoosterUnserializeFromBuffer_R", (DL_FUNC) &XGBoosterUnserializeFromBuffer_R, 2},
{"XGBoosterModelToRaw_R", (DL_FUNC) &XGBoosterModelToRaw_R, 1},
{"XGBoosterPredict_R", (DL_FUNC) &XGBoosterPredict_R, 5},
{"XGBoosterPredictFromDMatrix_R", (DL_FUNC) &XGBoosterPredictFromDMatrix_R, 3},
{"XGBoosterSaveModel_R", (DL_FUNC) &XGBoosterSaveModel_R, 2},
@@ -73,8 +72,7 @@ static const R_CallMethodDef CallEntries[] = {
{"XGBoosterSetParam_R", (DL_FUNC) &XGBoosterSetParam_R, 3},
{"XGBoosterUpdateOneIter_R", (DL_FUNC) &XGBoosterUpdateOneIter_R, 3},
{"XGCheckNullPtr_R", (DL_FUNC) &XGCheckNullPtr_R, 1},
{"XGDMatrixCreateFromCSC_R", (DL_FUNC) &XGDMatrixCreateFromCSC_R, 5},
{"XGDMatrixCreateFromCSR_R", (DL_FUNC) &XGDMatrixCreateFromCSR_R, 5},
{"XGDMatrixCreateFromCSC_R", (DL_FUNC) &XGDMatrixCreateFromCSC_R, 4},
{"XGDMatrixCreateFromFile_R", (DL_FUNC) &XGDMatrixCreateFromFile_R, 2},
{"XGDMatrixCreateFromMat_R", (DL_FUNC) &XGDMatrixCreateFromMat_R, 3},
{"XGDMatrixGetInfo_R", (DL_FUNC) &XGDMatrixGetInfo_R, 2},

View File

@@ -1,23 +1,16 @@
/**
* Copyright 2014-2022 by XGBoost Contributors
*/
#include <dmlc/common.h>
// Copyright (c) 2014 by Contributors
#include <dmlc/logging.h>
#include <dmlc/omp.h>
#include <dmlc/common.h>
#include <xgboost/c_api.h>
#include <xgboost/data.h>
#include <xgboost/generic_parameters.h>
#include <xgboost/logging.h>
#include <cstdio>
#include <cstring>
#include <sstream>
#include <vector>
#include <string>
#include <utility>
#include <vector>
#include <cstring>
#include <cstdio>
#include <sstream>
#include "../../src/c_api/c_api_error.h"
#include "../../src/common/threading_utils.h"
#include "./xgboost_R.h"
/*!
@@ -44,21 +37,8 @@
error(XGBGetLastError()); \
}
using dmlc::BeginPtr;
xgboost::GenericParameter const *BoosterCtx(BoosterHandle handle) {
CHECK_HANDLE();
auto *learner = static_cast<xgboost::Learner *>(handle);
CHECK(learner);
return learner->Ctx();
}
xgboost::GenericParameter const *DMatrixCtx(DMatrixHandle handle) {
CHECK_HANDLE();
auto p_m = static_cast<std::shared_ptr<xgboost::DMatrix> *>(handle);
CHECK(p_m);
return p_m->get()->Ctx();
}
using namespace dmlc;
XGB_DLL SEXP XGCheckNullPtr_R(SEXP handle) {
return ScalarLogical(R_ExternalPtrAddr(handle) == NULL);
@@ -114,13 +94,18 @@ XGB_DLL SEXP XGDMatrixCreateFromMat_R(SEXP mat, SEXP missing, SEXP n_threads) {
din = REAL(mat);
}
std::vector<float> data(nrow * ncol);
dmlc::OMPException exc;
int32_t threads = xgboost::common::OmpGetNumThreads(asInteger(n_threads));
xgboost::common::ParallelFor(nrow, threads, [&](xgboost::omp_ulong i) {
for (size_t j = 0; j < ncol; ++j) {
data[i * ncol + j] = is_int ? static_cast<float>(iin[i + nrow * j]) : din[i + nrow * j];
}
});
#pragma omp parallel for schedule(static) num_threads(threads)
for (omp_ulong i = 0; i < nrow; ++i) {
exc.Run([&]() {
for (size_t j = 0; j < ncol; ++j) {
data[i * ncol +j] = is_int ? static_cast<float>(iin[i + nrow * j]) : din[i + nrow * j];
}
});
}
exc.Rethrow();
DMatrixHandle handle;
CHECK_CALL(XGDMatrixCreateFromMat_omp(BeginPtr(data), nrow, ncol,
asReal(missing), &handle, threads));
@@ -132,7 +117,7 @@ XGB_DLL SEXP XGDMatrixCreateFromMat_R(SEXP mat, SEXP missing, SEXP n_threads) {
}
XGB_DLL SEXP XGDMatrixCreateFromCSC_R(SEXP indptr, SEXP indices, SEXP data,
SEXP num_row, SEXP n_threads) {
SEXP num_row) {
SEXP ret;
R_API_BEGIN();
const int *p_indptr = INTEGER(indptr);
@@ -148,11 +133,15 @@ XGB_DLL SEXP XGDMatrixCreateFromCSC_R(SEXP indptr, SEXP indices, SEXP data,
for (size_t i = 0; i < nindptr; ++i) {
col_ptr_[i] = static_cast<size_t>(p_indptr[i]);
}
int32_t threads = xgboost::common::OmpGetNumThreads(asInteger(n_threads));
xgboost::common::ParallelFor(ndata, threads, [&](xgboost::omp_ulong i) {
indices_[i] = static_cast<unsigned>(p_indices[i]);
data_[i] = static_cast<float>(p_data[i]);
});
dmlc::OMPException exc;
#pragma omp parallel for schedule(static)
for (int64_t i = 0; i < static_cast<int64_t>(ndata); ++i) {
exc.Run([&]() {
indices_[i] = static_cast<unsigned>(p_indices[i]);
data_[i] = static_cast<float>(p_data[i]);
});
}
exc.Rethrow();
DMatrixHandle handle;
CHECK_CALL(XGDMatrixCreateFromCSCEx(BeginPtr(col_ptr_), BeginPtr(indices_),
BeginPtr(data_), nindptr, ndata,
@@ -164,39 +153,6 @@ XGB_DLL SEXP XGDMatrixCreateFromCSC_R(SEXP indptr, SEXP indices, SEXP data,
return ret;
}
XGB_DLL SEXP XGDMatrixCreateFromCSR_R(SEXP indptr, SEXP indices, SEXP data,
SEXP num_col, SEXP n_threads) {
SEXP ret;
R_API_BEGIN();
const int *p_indptr = INTEGER(indptr);
const int *p_indices = INTEGER(indices);
const double *p_data = REAL(data);
size_t nindptr = static_cast<size_t>(length(indptr));
size_t ndata = static_cast<size_t>(length(data));
size_t ncol = static_cast<size_t>(INTEGER(num_col)[0]);
std::vector<size_t> row_ptr_(nindptr);
std::vector<unsigned> indices_(ndata);
std::vector<float> data_(ndata);
for (size_t i = 0; i < nindptr; ++i) {
row_ptr_[i] = static_cast<size_t>(p_indptr[i]);
}
int32_t threads = xgboost::common::OmpGetNumThreads(asInteger(n_threads));
xgboost::common::ParallelFor(ndata, threads, [&](xgboost::omp_ulong i) {
indices_[i] = static_cast<unsigned>(p_indices[i]);
data_[i] = static_cast<float>(p_data[i]);
});
DMatrixHandle handle;
CHECK_CALL(XGDMatrixCreateFromCSREx(BeginPtr(row_ptr_), BeginPtr(indices_),
BeginPtr(data_), nindptr, ndata,
ncol, &handle));
ret = PROTECT(R_MakeExternalPtr(handle, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
R_API_END();
UNPROTECT(1);
return ret;
}
XGB_DLL SEXP XGDMatrixSliceDMatrix_R(SEXP handle, SEXP idxset) {
SEXP ret;
R_API_BEGIN();
@@ -230,20 +186,31 @@ XGB_DLL SEXP XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
R_API_BEGIN();
int len = length(array);
const char *name = CHAR(asChar(field));
auto ctx = DMatrixCtx(R_ExternalPtrAddr(handle));
dmlc::OMPException exc;
if (!strcmp("group", name)) {
std::vector<unsigned> vec(len);
xgboost::common::ParallelFor(len, ctx->Threads(), [&](xgboost::omp_ulong i) {
vec[i] = static_cast<unsigned>(INTEGER(array)[i]);
});
CHECK_CALL(
XGDMatrixSetUIntInfo(R_ExternalPtrAddr(handle), CHAR(asChar(field)), BeginPtr(vec), len));
#pragma omp parallel for schedule(static)
for (int i = 0; i < len; ++i) {
exc.Run([&]() {
vec[i] = static_cast<unsigned>(INTEGER(array)[i]);
});
}
exc.Rethrow();
CHECK_CALL(XGDMatrixSetUIntInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
BeginPtr(vec), len));
} else {
std::vector<float> vec(len);
xgboost::common::ParallelFor(len, ctx->Threads(),
[&](xgboost::omp_ulong i) { vec[i] = REAL(array)[i]; });
CHECK_CALL(
XGDMatrixSetFloatInfo(R_ExternalPtrAddr(handle), CHAR(asChar(field)), BeginPtr(vec), len));
#pragma omp parallel for schedule(static)
for (int i = 0; i < len; ++i) {
exc.Run([&]() {
vec[i] = REAL(array)[i];
});
}
exc.Rethrow();
CHECK_CALL(XGDMatrixSetFloatInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
BeginPtr(vec), len));
}
R_API_END();
return R_NilValue;
@@ -346,11 +313,15 @@ XGB_DLL SEXP XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP h
<< "gradient and hess must have same length";
int len = length(grad);
std::vector<float> tgrad(len), thess(len);
auto ctx = BoosterCtx(R_ExternalPtrAddr(handle));
xgboost::common::ParallelFor(len, ctx->Threads(), [&](xgboost::omp_ulong j) {
tgrad[j] = REAL(grad)[j];
thess[j] = REAL(hess)[j];
});
dmlc::OMPException exc;
#pragma omp parallel for schedule(static)
for (int j = 0; j < len; ++j) {
exc.Run([&]() {
tgrad[j] = REAL(grad)[j];
thess[j] = REAL(hess)[j];
});
}
exc.Rethrow();
CHECK_CALL(XGBoosterBoostOneIter(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dtrain),
BeginPtr(tgrad), BeginPtr(thess),
@@ -427,10 +398,11 @@ XGB_DLL SEXP XGBoosterPredictFromDMatrix_R(SEXP handle, SEXP dmat, SEXP json_con
len *= out_shape[i];
}
r_out_result = PROTECT(allocVector(REALSXP, len));
auto ctx = BoosterCtx(R_ExternalPtrAddr(handle));
xgboost::common::ParallelFor(len, ctx->Threads(), [&](xgboost::omp_ulong i) {
#pragma omp parallel for
for (omp_ulong i = 0; i < len; ++i) {
REAL(r_out_result)[i] = out_result[i];
});
}
r_out = PROTECT(allocVector(VECSXP, 2));
@@ -457,6 +429,21 @@ XGB_DLL SEXP XGBoosterSaveModel_R(SEXP handle, SEXP fname) {
return R_NilValue;
}
XGB_DLL SEXP XGBoosterModelToRaw_R(SEXP handle) {
SEXP ret;
R_API_BEGIN();
bst_ulong olen;
const char *raw;
CHECK_CALL(XGBoosterGetModelRaw(R_ExternalPtrAddr(handle), &olen, &raw));
ret = PROTECT(allocVector(RAWSXP, olen));
if (olen != 0) {
memcpy(RAW(ret), raw, olen);
}
R_API_END();
UNPROTECT(1);
return ret;
}
XGB_DLL SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadModelFromBuffer(R_ExternalPtrAddr(handle),
@@ -466,22 +453,6 @@ XGB_DLL SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
return R_NilValue;
}
XGB_DLL SEXP XGBoosterSaveModelToRaw_R(SEXP handle, SEXP json_config) {
SEXP ret;
R_API_BEGIN();
bst_ulong olen;
char const *c_json_config = CHAR(asChar(json_config));
char const *raw;
CHECK_CALL(XGBoosterSaveModelToBuffer(R_ExternalPtrAddr(handle), c_json_config, &olen, &raw))
ret = PROTECT(allocVector(RAWSXP, olen));
if (olen != 0) {
std::memcpy(RAW(ret), raw, olen);
}
R_API_END();
UNPROTECT(1);
return ret;
}
XGB_DLL SEXP XGBoosterSaveJsonConfig_R(SEXP handle) {
const char* ret;
R_API_BEGIN();
@@ -628,6 +599,7 @@ XGB_DLL SEXP XGBoosterFeatureScore_R(SEXP handle, SEXP json_config) {
CHECK_CALL(XGBoosterFeatureScore(R_ExternalPtrAddr(handle), c_json_config,
&out_n_features, &out_features,
&out_dim, &out_shape, &out_scores));
out_shape_sexp = PROTECT(allocVector(INTSXP, out_dim));
size_t len = 1;
for (size_t i = 0; i < out_dim; ++i) {
@@ -636,10 +608,10 @@ XGB_DLL SEXP XGBoosterFeatureScore_R(SEXP handle, SEXP json_config) {
}
out_scores_sexp = PROTECT(allocVector(REALSXP, len));
auto ctx = BoosterCtx(R_ExternalPtrAddr(handle));
xgboost::common::ParallelFor(len, ctx->Threads(), [&](xgboost::omp_ulong i) {
#pragma omp parallel for
for (omp_ulong i = 0; i < len; ++i) {
REAL(out_scores_sexp)[i] = out_scores[i];
});
}
out_features_sexp = PROTECT(allocVector(STRSXP, out_n_features));
for (size_t i = 0; i < out_n_features; ++i) {

View File

@@ -1,5 +1,5 @@
/*!
* Copyright 2014-2022 by XGBoost Contributors
* Copyright 2014 (c) by Contributors
* \file xgboost_R.h
* \author Tianqi Chen
* \brief R wrapper of xgboost
@@ -59,23 +59,12 @@ XGB_DLL SEXP XGDMatrixCreateFromMat_R(SEXP mat,
* \param indices row indices
* \param data content of the data
* \param num_row numer of rows (when it's set to 0, then guess from data)
* \param n_threads Number of threads used to construct DMatrix from csc matrix.
* \return created dmatrix
*/
XGB_DLL SEXP XGDMatrixCreateFromCSC_R(SEXP indptr, SEXP indices, SEXP data, SEXP num_row,
SEXP n_threads);
/*!
* \brief create a matrix content from CSR format
* \param indptr pointer to row headers
* \param indices column indices
* \param data content of the data
* \param num_col numer of columns (when it's set to 0, then guess from data)
* \param n_threads Number of threads used to construct DMatrix from csr matrix.
* \return created dmatrix
*/
XGB_DLL SEXP XGDMatrixCreateFromCSR_R(SEXP indptr, SEXP indices, SEXP data, SEXP num_col,
SEXP n_threads);
XGB_DLL SEXP XGDMatrixCreateFromCSC_R(SEXP indptr,
SEXP indices,
SEXP data,
SEXP num_row);
/*!
* \brief create a new dmatrix from sliced content of existing matrix
@@ -220,21 +209,11 @@ XGB_DLL SEXP XGBoosterSaveModel_R(SEXP handle, SEXP fname);
XGB_DLL SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw);
/*!
* \brief Save model into R's raw array
*
* \brief save model into R's raw array
* \param handle handle
* \param json_config JSON encoded string storing parameters for the function. Following
* keys are expected in the JSON document:
*
* "format": str
* - json: Output booster will be encoded as JSON.
* - ubj: Output booster will be encoded as Univeral binary JSON.
* - deprecated: Output booster will be encoded as old custom binary format. Do now use
* this format except for compatibility reasons.
*
* \return Raw array
* \return raw array
*/
XGB_DLL SEXP XGBoosterSaveModelToRaw_R(SEXP handle, SEXP json_config);
XGB_DLL SEXP XGBoosterModelToRaw_R(SEXP handle);
/*!
* \brief Save internal parameters as a JSON string

View File

@@ -1,5 +1,4 @@
require(xgboost)
library(Matrix)
context("basic functions")
@@ -460,18 +459,3 @@ test_that("strict_shape works", {
test_iris()
test_agaricus()
})
test_that("'predict' accepts CSR data", {
X <- agaricus.train$data
y <- agaricus.train$label
x_csc <- as(X[1L, , drop = FALSE], "CsparseMatrix")
x_csr <- as(x_csc, "RsparseMatrix")
x_spv <- as(x_csc, "sparseVector")
bst <- xgboost(data = X, label = y, objective = "binary:logistic",
nrounds = 5L, verbose = FALSE)
p_csc <- predict(bst, x_csc)
p_csr <- predict(bst, x_csr)
p_spv <- predict(bst, x_spv)
expect_equal(p_csc, p_csr)
expect_equal(p_csc, p_spv)
})

View File

@@ -27,7 +27,6 @@ test_that("xgb.DMatrix: saving, loading", {
# save to a local file
dtest1 <- xgb.DMatrix(test_data, label = test_label)
tmp_file <- tempfile('xgb.DMatrix_')
on.exit(unlink(tmp_file))
expect_true(xgb.DMatrix.save(dtest1, tmp_file))
# read from a local file
expect_output(dtest3 <- xgb.DMatrix(tmp_file), "entries loaded from")
@@ -42,6 +41,7 @@ test_that("xgb.DMatrix: saving, loading", {
dtest4 <- xgb.DMatrix(tmp_file, silent = TRUE)
expect_equal(dim(dtest4), c(3, 4))
expect_equal(getinfo(dtest4, 'label'), c(0, 1, 0))
unlink(tmp_file)
})
test_that("xgb.DMatrix: getinfo & setinfo", {

View File

@@ -1,27 +0,0 @@
library(xgboost)
context("feature weights")
test_that("training with feature weights works", {
nrows <- 1000
ncols <- 9
set.seed(2022)
x <- matrix(rnorm(nrows * ncols), nrow = nrows)
y <- rowSums(x)
weights <- seq(from = 1, to = ncols)
test <- function(tm) {
names <- paste0("f", 1:ncols)
xy <- xgb.DMatrix(data = x, label = y, feature_weights = weights)
params <- list(colsample_bynode = 0.4, tree_method = tm, nthread = 1)
model <- xgb.train(params = params, data = xy, nrounds = 32)
importance <- xgb.importance(model = model, feature_names = names)
expect_equal(dim(importance), c(ncols, 4))
importance <- importance[order(importance$Feature)]
expect_lt(importance[1, Frequency], importance[9, Frequency])
}
for (tm in c("hist", "approx", "exact")) {
test(tm)
}
})

View File

@@ -46,31 +46,3 @@ test_that("gblinear works", {
expect_equal(dim(h), c(n, ncol(dtrain) + 1))
expect_s4_class(h, "dgCMatrix")
})
test_that("gblinear early stopping works", {
data(agaricus.train, package = 'xgboost')
data(agaricus.test, package = 'xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
param <- list(
objective = "binary:logistic", eval_metric = "error", booster = "gblinear",
nthread = 2, eta = 0.8, alpha = 0.0001, lambda = 0.0001,
updater = "coord_descent"
)
es_round <- 1
n <- 10
booster <- xgb.train(
param, dtrain, n, list(eval = dtest, train = dtrain), early_stopping_rounds = es_round
)
expect_equal(booster$best_iteration, 5)
predt_es <- predict(booster, dtrain)
n <- booster$best_iteration + es_round
booster <- xgb.train(
param, dtrain, n, list(eval = dtest, train = dtrain), early_stopping_rounds = es_round
)
predt <- predict(booster, dtrain)
expect_equal(predt_es, predt)
})

View File

@@ -340,16 +340,6 @@ test_that("xgb.importance works with and without feature names", {
imp
}
expect_equal(importance_from_dump(), importance, tolerance = 1e-6)
## decision stump
m <- xgboost::xgboost(
data = as.matrix(data.frame(x = c(0, 1))),
label = c(1, 2),
nrounds = 1
)
df <- xgb.model.dt.tree(model = m)
expect_equal(df$Feature, "Leaf")
expect_equal(df$Cover, 2)
})
test_that("xgb.importance works with GLM model", {

View File

@@ -157,28 +157,3 @@ test_that("multiclass feature interactions work", {
# sums WRT columns must be close to feature contributions
expect_lt(max(abs(apply(intr, c(1, 2, 3), sum) - aperm(cont, c(3, 1, 2)))), 0.00001)
})
test_that("SHAP single sample works", {
train <- agaricus.train
test <- agaricus.test
booster <- xgboost(
data = train$data,
label = train$label,
max_depth = 2,
nrounds = 4,
objective = "binary:logistic",
)
predt <- predict(
booster,
newdata = train$data[1, , drop = FALSE], predcontrib = TRUE
)
expect_equal(dim(predt), c(1, dim(train$data)[2] + 1))
predt <- predict(
booster,
newdata = train$data[1, , drop = FALSE], predinteraction = TRUE
)
expect_equal(dim(predt), c(1, dim(train$data)[2] + 1, dim(train$data)[2] + 1))
})

View File

@@ -1,30 +0,0 @@
context("Test model IO.")
## some other tests are in test_basic.R
require(xgboost)
require(testthat)
data(agaricus.train, package = "xgboost")
data(agaricus.test, package = "xgboost")
train <- agaricus.train
test <- agaricus.test
test_that("load/save raw works", {
nrounds <- 8
booster <- xgboost(
data = train$data, label = train$label,
nrounds = nrounds, objective = "binary:logistic"
)
json_bytes <- xgb.save.raw(booster, raw_format = "json")
ubj_bytes <- xgb.save.raw(booster, raw_format = "ubj")
old_bytes <- xgb.save.raw(booster, raw_format = "deprecated")
from_json <- xgb.load.raw(json_bytes, as_booster = TRUE)
from_ubj <- xgb.load.raw(ubj_bytes, as_booster = TRUE)
json2old <- xgb.save.raw(from_json, raw_format = "deprecated")
ubj2old <- xgb.save.raw(from_ubj, raw_format = "deprecated")
expect_equal(json2old, ubj2old)
expect_equal(json2old, old_bytes)
})

View File

@@ -138,7 +138,7 @@ levels(df[,Treatment])
Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
We will use the [dummy contrast coding](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.

View File

@@ -33,8 +33,8 @@
#include "../src/gbm/gblinear_model.cc"
// data
#include "../src/data/simple_dmatrix.cc"
#include "../src/data/data.cc"
#include "../src/data/simple_dmatrix.cc"
#include "../src/data/sparse_page_raw_format.cc"
#include "../src/data/ellpack_page.cc"
#include "../src/data/gradient_index.cc"
@@ -48,18 +48,16 @@
#include "../src/predictor/cpu_predictor.cc"
// trees
#include "../src/tree/constraints.cc"
#include "../src/tree/hist/param.cc"
#include "../src/tree/param.cc"
#include "../src/tree/tree_model.cc"
#include "../src/tree/tree_updater.cc"
#include "../src/tree/updater_approx.cc"
#include "../src/tree/updater_colmaker.cc"
#include "../src/tree/updater_histmaker.cc"
#include "../src/tree/updater_prune.cc"
#include "../src/tree/updater_quantile_hist.cc"
#include "../src/tree/updater_prune.cc"
#include "../src/tree/updater_refresh.cc"
#include "../src/tree/updater_sync.cc"
#include "../src/tree/updater_histmaker.cc"
#include "../src/tree/constraints.cc"
// linear
#include "../src/linear/linear_updater.cc"
@@ -77,11 +75,9 @@
#include "../src/common/quantile.cc"
#include "../src/common/host_device_vector.cc"
#include "../src/common/hist_util.cc"
#include "../src/common/io.cc"
#include "../src/common/json.cc"
#include "../src/common/pseudo_huber.cc"
#include "../src/common/io.cc"
#include "../src/common/survival_util.cc"
#include "../src/common/threading_utils.cc"
#include "../src/common/version.cc"
// c_api

View File

@@ -1 +1 @@
@xgboost_VERSION_MAJOR@.@xgboost_VERSION_MINOR@.@xgboost_VERSION_PATCH@rc1
@xgboost_VERSION_MAJOR@.@xgboost_VERSION_MINOR@.@xgboost_VERSION_PATCH@

View File

@@ -15,7 +15,7 @@ endfunction(auto_source_group)
# Force static runtime for MSVC
function(msvc_use_static_runtime)
if(MSVC AND (NOT BUILD_SHARED_LIBS) AND (NOT FORCE_SHARED_CRT))
if(MSVC)
set(variables
CMAKE_C_FLAGS_DEBUG
CMAKE_C_FLAGS_MINSIZEREL
@@ -105,10 +105,9 @@ function(format_gencode_flags flags out)
if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.18")
cmake_policy(SET CMP0104 NEW)
list(POP_BACK flags latest_arch)
list(TRANSFORM flags APPEND "-real")
list(APPEND flags ${latest_arch})
set(CMAKE_CUDA_ARCHITECTURES ${flags})
foreach(ver ${flags})
set(CMAKE_CUDA_ARCHITECTURES "${ver}-real;${ver}-virtual;${CMAKE_CUDA_ARCHITECTURES}")
endforeach()
set(CMAKE_CUDA_ARCHITECTURES "${CMAKE_CUDA_ARCHITECTURES}" PARENT_SCOPE)
message(STATUS "CMAKE_CUDA_ARCHITECTURES: ${CMAKE_CUDA_ARCHITECTURES}")
else()
@@ -137,8 +136,7 @@ function(xgboost_set_cuda_flags target)
$<$<COMPILE_LANGUAGE:CUDA>:--expt-extended-lambda>
$<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>
$<$<COMPILE_LANGUAGE:CUDA>:${GEN_CODE}>
$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=${OpenMP_CXX_FLAGS}>
$<$<COMPILE_LANGUAGE:CUDA>:-Xfatbin=-compress-all>)
$<$<COMPILE_LANGUAGE:CUDA>:-Xcompiler=${OpenMP_CXX_FLAGS}>)
if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.18")
set_property(TARGET ${target} PROPERTY CUDA_ARCHITECTURES ${CMAKE_CUDA_ARCHITECTURES})

View File

@@ -1,4 +1,4 @@
#!/usr/bin/env python3
#!/usr/bin/python
def loadfmap( fname ):
fmap = {}

View File

@@ -1,5 +1,4 @@
#!/usr/bin/env python3
#!/usr/bin/python
import sys
import random
@@ -27,3 +26,4 @@ for l in fi:
fi.close()
ftr.close()
fte.close()

View File

@@ -1,4 +1,4 @@
#!/usr/bin/env python3
#!/usr/bin/python
fo = open('machine.txt', 'w')
cnt = 6

View File

@@ -1,5 +1,4 @@
#!/usr/bin/env python3
#!/usr/bin/python
import sys
import random

View File

@@ -1,5 +1,3 @@
#!/usr/bin/env python3
import sys
fo = open(sys.argv[2], 'w')

View File

@@ -136,7 +136,7 @@ Send a PR to add a one sentence description:)
- XGBoost is used in [Kaggle Script](https://www.kaggle.com/scripts) to solve data science challenges.
- Distribute XGBoost as Rest API server from Jupyter notebook with [BentoML](https://github.com/bentoml/bentoml). [Link to notebook](https://github.com/bentoml/BentoML/blob/master/examples/xgboost-predict-titanic-survival/XGBoost-titanic-survival-prediction.ipynb)
- [Seldon predictive service powered by XGBoost](https://docs.seldon.io/projects/seldon-core/en/latest/servers/xgboost.html)
- [Seldon predictive service powered by XGBoost](http://docs.seldon.io/iris-demo.html)
- XGBoost Distributed is used in [ODPS Cloud Service by Alibaba](https://yq.aliyun.com/articles/6355) (in Chinese)
- XGBoost is incoporated as part of [Graphlab Create](https://dato.com/products/create/) for scalable machine learning.
- [Hanjing Su](https://www.52cs.org) from Tencent data platform team: "We use distributed XGBoost for click through prediction in wechat shopping and lookalikes. The problems involve hundreds millions of users and thousands of features. XGBoost is cleanly designed and can be easily integrated into our production environment, reducing our cost in developments."
@@ -146,7 +146,7 @@ Send a PR to add a one sentence description:)
- [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API
- [FLAML](https://github.com/microsoft/FLAML) - An open source AutoML library
designed to automatically produce accurate machine learning models with low computational cost. FLAML includes [XGBoost as one of the default learners](https://github.com/microsoft/FLAML/blob/main/flaml/model.py) and can also be used as a fast hyperparameter tuning tool for XGBoost ([code example](https://microsoft.github.io/FLAML/docs/Examples/AutoML-for-XGBoost)).
designed to automatically produce accurate machine learning models with low computational cost. FLAML includes [XGBoost as one of the default learners](https://github.com/microsoft/FLAML/blob/main/flaml/model.py) and can also be used as a fast hyperparameter tuning tool for XGBoost ([code example](https://github.com/microsoft/FLAML/blob/main/notebook/flaml_xgboost.ipynb)).
- [gp_xgboost_gridsearch](https://github.com/vatsan/gp_xgboost_gridsearch) - In-database parallel grid-search for XGBoost on [Greenplum](https://github.com/greenplum-db/gpdb) using PL/Python
- [tpot](https://github.com/rhiever/tpot) - A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.

6
demo/dask/README.md Normal file
View File

@@ -0,0 +1,6 @@
Dask
====
This directory contains some demonstrations for using `dask` with `XGBoost`.
For an overview, see
https://xgboost.readthedocs.io/en/latest/tutorials/dask.html .

View File

@@ -1,5 +0,0 @@
XGBoost Dask Feature Walkthrough
================================
This directory contains some demonstrations for using `dask` with `XGBoost`. For an
overview, see :doc:`/tutorials/dask`

View File

@@ -1,7 +1,4 @@
"""
Example of using callbacks with Dask
====================================
"""
"""Example of using callbacks in Dask"""
import numpy as np
import xgboost as xgb
from xgboost.dask import DaskDMatrix

View File

@@ -1,9 +1,3 @@
"""
Example of training survival model with Dask on CPU
===================================================
"""
import xgboost as xgb
import os
from xgboost.dask import DaskDMatrix

View File

@@ -1,8 +1,3 @@
"""
Example of training with Dask on CPU
====================================
"""
import xgboost as xgb
from xgboost.dask import DaskDMatrix
from dask.distributed import Client

View File

@@ -1,7 +1,3 @@
"""
Example of training with Dask on GPU
====================================
"""
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask import array as da

View File

@@ -1,7 +1,6 @@
"""
Use scikit-learn regressor interface with CPU histogram tree method
===================================================================
"""
'''Dask interface demo:
Use scikit-learn regressor interface with CPU histogram tree method.'''
from dask.distributed import Client
from dask.distributed import LocalCluster
from dask import array as da
@@ -17,7 +16,7 @@ def main(client):
y = da.random.random(m, partition_size)
regressor = xgboost.dask.DaskXGBRegressor(verbosity=1, n_estimators=2)
regressor.set_params(tree_method="hist")
regressor.set_params(tree_method='hist')
# assigning client here is optional
regressor.client = client
@@ -27,13 +26,13 @@ def main(client):
bst = regressor.get_booster()
history = regressor.evals_result()
print("Evaluation history:", history)
print('Evaluation history:', history)
# returned prediction is always a dask array.
assert isinstance(prediction, da.Array)
return bst # returning the trained model
return bst # returning the trained model
if __name__ == "__main__":
if __name__ == '__main__':
# or use other clusters for scaling
with LocalCluster(n_workers=4, threads_per_worker=1) as cluster:
with Client(cluster) as client:

View File

@@ -1,7 +1,6 @@
"""
Use scikit-learn regressor interface with GPU histogram tree method
===================================================================
"""
'''Dask interface demo:
Use scikit-learn regressor interface with GPU histogram tree method.'''
from dask.distributed import Client
# It's recommended to use dask_cuda for GPU assignment

View File

@@ -0,0 +1,18 @@
XGBoost Python Feature Walkthrough
==================================
* [Basic walkthrough of wrappers](basic_walkthrough.py)
* [Customize loss function, and evaluation metric](custom_objective.py)
* [Re-implement RMSLE as customized metric and objective](custom_rmsle.py)
* [Re-Implement `multi:softmax` objective as customized objective](custom_softmax.py)
* [Boosting from existing prediction](boost_from_prediction.py)
* [Predicting using first n trees](predict_first_ntree.py)
* [Generalized Linear Model](generalized_linear_model.py)
* [Cross validation](cross_validation.py)
* [Predicting leaf indices](predict_leaf_indices.py)
* [Sklearn Wrapper](sklearn_examples.py)
* [Sklearn Parallel](sklearn_parallel.py)
* [Sklearn access evals result](sklearn_evals_result.py)
* [Access evals result](evals_result.py)
* [External Memory](external_memory.py)
* [Training continuation](continuation.py)
* [Feature weights for column sampling](feature_weights.py)

View File

@@ -1,5 +0,0 @@
XGBoost Python Feature Walkthrough
==================================
This is a collection of examples for using the XGBoost Python package.

View File

@@ -1,7 +1,3 @@
"""
Getting started with XGBoost
============================
"""
import numpy as np
import scipy.sparse
import pickle

View File

@@ -1,7 +1,3 @@
"""
Demo for boosting from prediction
=================================
"""
import os
import xgboost as xgb

View File

@@ -1,6 +1,5 @@
'''
Demo for using and defining callback functions
==============================================
Demo for using and defining callback functions.
.. versionadded:: 1.3.0
'''

View File

@@ -1,124 +0,0 @@
"""
Train XGBoost with cat_in_the_dat dataset
=========================================
A simple demo for categorical data support using dataset from Kaggle categorical data
tutorial.
The excellent tutorial is at:
https://www.kaggle.com/shahules/an-overview-of-encoding-techniques
And the data can be found at:
https://www.kaggle.com/shahules/an-overview-of-encoding-techniques/data
Also, see the tutorial for using XGBoost with categorical data:
:doc:`/tutorials/categorical`.
.. versionadded 1.6.0
"""
from __future__ import annotations
from time import time
import os
from tempfile import TemporaryDirectory
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import xgboost as xgb
def load_cat_in_the_dat() -> tuple[pd.DataFrame, pd.Series]:
"""Assuming you have already downloaded the data into `input` directory."""
df_train = pd.read_csv("./input/cat-in-the-dat/train.csv")
print(
"train data set has got {} rows and {} columns".format(
df_train.shape[0], df_train.shape[1]
)
)
X = df_train.drop(["target"], axis=1)
y = df_train["target"]
for i in range(0, 5):
X["bin_" + str(i)] = X["bin_" + str(i)].astype("category")
for i in range(0, 5):
X["nom_" + str(i)] = X["nom_" + str(i)].astype("category")
for i in range(5, 10):
X["nom_" + str(i)] = X["nom_" + str(i)].apply(int, base=16)
for i in range(0, 6):
X["ord_" + str(i)] = X["ord_" + str(i)].astype("category")
print(
"train data set has got {} rows and {} columns".format(X.shape[0], X.shape[1])
)
return X, y
params = {
"tree_method": "gpu_hist",
"use_label_encoder": False,
"n_estimators": 32,
"colsample_bylevel": 0.7,
}
def categorical_model(X: pd.DataFrame, y: pd.Series, output_dir: str) -> None:
"""Train using builtin categorical data support from XGBoost"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=1994, test_size=0.2
)
# Specify `enable_categorical` to True.
clf = xgb.XGBClassifier(
**params,
eval_metric="auc",
enable_categorical=True,
max_cat_to_onehot=1, # We use optimal partitioning exclusively
)
clf.fit(X_train, y_train, eval_set=[(X_test, y_test), (X_train, y_train)])
clf.save_model(os.path.join(output_dir, "categorical.json"))
y_score = clf.predict_proba(X_test)[:, 1] # proba of positive samples
auc = roc_auc_score(y_test, y_score)
print("AUC of using builtin categorical data support:", auc)
def onehot_encoding_model(X: pd.DataFrame, y: pd.Series, output_dir: str) -> None:
"""Train using one-hot encoded data."""
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42, test_size=0.2
)
# Specify `enable_categorical` to False as we are using encoded data.
clf = xgb.XGBClassifier(**params, eval_metric="auc", enable_categorical=False)
clf.fit(
X_train,
y_train,
eval_set=[(X_test, y_test), (X_train, y_train)],
)
clf.save_model(os.path.join(output_dir, "one-hot.json"))
y_score = clf.predict_proba(X_test)[:, 1] # proba of positive samples
auc = roc_auc_score(y_test, y_score)
print("AUC of using onehot encoding:", auc)
if __name__ == "__main__":
X, y = load_cat_in_the_dat()
with TemporaryDirectory() as tmpdir:
start = time()
categorical_model(X, y, tmpdir)
end = time()
print("Duration:categorical", end - start)
X = pd.get_dummies(X)
start = time()
onehot_encoding_model(X, y, tmpdir)
end = time()
print("Duration:onehot", end - start)

View File

@@ -1,17 +1,9 @@
"""
Getting started with categorical data
=====================================
Experimental support for categorical data. After 1.5 XGBoost `gpu_hist` tree method has
experimental support for one-hot encoding based tree split, and in 1.6 `approx` support
was added.
"""Experimental support for categorical data. After 1.5 XGBoost `gpu_hist` tree method
has experimental support for one-hot encoding based tree split.
In before, users need to run an encoder themselves before passing the data into XGBoost,
which creates a sparse matrix and potentially increase memory usage. This demo
showcases the experimental categorical data support, more advanced features are planned.
Also, see :doc:`the tutorial </tutorials/categorical>` for using XGBoost with
categorical data.
which creates a sparse matrix and potentially increase memory usage. This demo showcases
the experimental categorical data support, more advanced features are planned.
.. versionadded:: 1.5.0
@@ -55,11 +47,8 @@ def main() -> None:
# For scikit-learn interface, the input data must be pandas DataFrame or cudf
# DataFrame with categorical features
X, y = make_categorical(100, 10, 4, False)
# Specify `enable_categorical` to True, also we use onehot encoding based split
# here for demonstration. For details see the document of `max_cat_to_onehot`.
reg = xgb.XGBRegressor(
tree_method="gpu_hist", enable_categorical=True, max_cat_to_onehot=5
)
# Specify `enable_categorical` to True.
reg = xgb.XGBRegressor(tree_method="gpu_hist", enable_categorical=True)
reg.fit(X, y, eval_set=[(X, y)])
# Pass in already encoded data

View File

@@ -1,6 +1,5 @@
"""
Demo for training continuation
==============================
Demo for training continuation.
"""
from sklearn.datasets import load_breast_cancer

View File

@@ -1,7 +1,3 @@
"""
Demo for using cross validation
===============================
"""
import os
import numpy as np
import xgboost as xgb

View File

@@ -0,0 +1,61 @@
###
# advanced: customized loss function
#
import os
import numpy as np
import xgboost as xgb
print('start running example to used customized objective function')
CURRENT_DIR = os.path.dirname(__file__)
dtrain = xgb.DMatrix(os.path.join(CURRENT_DIR, '../data/agaricus.txt.train'))
dtest = xgb.DMatrix(os.path.join(CURRENT_DIR, '../data/agaricus.txt.test'))
# note: what we are getting is margin value in prediction you must know what
# you are doing
param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:logistic'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 10
# user define objective function, given prediction, return gradient and second
# order gradient this is log likelihood loss
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds)) # transform raw leaf weight
grad = preds - labels
hess = preds * (1.0 - preds)
return grad, hess
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is
# margin, which means the prediction is score before logistic transformation.
def evalerror(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds)) # transform raw leaf weight
# return a pair metric_name, result. The metric name must not contain a
# colon (:) or a space
return 'my-error', float(sum(labels != (preds > 0.5))) / len(labels)
py_evals_result = {}
# training with customized objective, we can also do step by step training
# simply look at training.py's implementation of train
py_params = param.copy()
py_params.update({'disable_default_eval_metric': True})
py_logreg = xgb.train(py_params, dtrain, num_round, watchlist, obj=logregobj,
feval=evalerror, evals_result=py_evals_result)
evals_result = {}
params = param.copy()
params.update({'eval_metric': 'error'})
logreg = xgb.train(params, dtrain, num_boost_round=num_round, evals=watchlist,
evals_result=evals_result)
for i in range(len(py_evals_result['train']['my-error'])):
np.testing.assert_almost_equal(py_evals_result['train']['my-error'],
evals_result['train']['error'])

View File

@@ -1,19 +1,16 @@
"""
Demo for defining a custom regression objective and metric
==========================================================
'''Demo for defining customized metric and objective. Notice that for
simplicity reason weight is not used in following example. In this
script, we implement the Squared Log Error (SLE) objective and RMSLE metric as customized
functions, then compare it with native implementation in XGBoost.
Demo for defining customized metric and objective. Notice that for simplicity reason
weight is not used in following example. In this script, we implement the Squared Log
Error (SLE) objective and RMSLE metric as customized functions, then compare it with
native implementation in XGBoost.
See doc/tutorials/custom_metric_obj.rst for a step by step
walkthrough, with other details.
See :doc:`/tutorials/custom_metric_obj` for a step by step walkthrough, with other
details.
The `SLE` objective reduces impact of outliers in training dataset,
hence here we also compare its performance with standard squared
error.
The `SLE` objective reduces impact of outliers in training dataset, hence here we also
compare its performance with standard squared error.
"""
'''
import numpy as np
import xgboost as xgb
from typing import Tuple, Dict, List
@@ -147,7 +144,7 @@ def py_rmsle(dtrain: xgb.DMatrix, dtest: xgb.DMatrix) -> Dict:
dtrain=dtrain,
num_boost_round=kBoostRound,
obj=squared_log,
custom_metric=rmsle,
feval=rmsle,
evals=[(dtrain, 'dtrain'), (dtest, 'dtest')],
evals_result=results)
@@ -174,6 +171,9 @@ def plot_history(rmse_evals, rmsle_evals, py_rmsle_evals):
ax2.plot(x, py_rmsle_evals['dtest']['PyRMSLE'], label='test-PyRMSLE')
ax2.legend()
plt.show()
plt.close()
def main(args):
dtrain, dtest = generate_data()
@@ -183,10 +183,9 @@ def main(args):
if args.plot != 0:
plot_history(rmse_evals, rmsle_evals, py_rmsle_evals)
plt.show()
if __name__ == "__main__":
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Arguments for custom RMSLE objective function demo.')
parser.add_argument(

View File

@@ -1,12 +1,7 @@
'''
Demo for creating customized multi-class objective function
===========================================================
This demo is only applicable after (excluding) XGBoost 1.0.0, as before this version
XGBoost returns transformed prediction for multi-class objective function. More details
in comments.
See :doc:`/tutorials/custom_metric_obj` for detailed tutorial and notes.
'''Demo for creating customized multi-class objective function. This demo is
only applicable after (excluding) XGBoost 1.0.0, as before this version XGBoost
returns transformed prediction for multi-class objective function. More
details in comments.
'''
@@ -100,12 +95,7 @@ def predict(booster: xgb.Booster, X):
def merror(predt: np.ndarray, dtrain: xgb.DMatrix):
y = dtrain.get_label()
# Like custom objective, the predt is untransformed leaf weight when custom objective
# is provided.
# With the use of `custom_metric` parameter in train function, custom metric receives
# raw input only when custom objective is also being used. Otherwise custom metric
# will receive transformed prediction.
# Like custom objective, the predt is untransformed leaf weight
assert predt.shape == (kRows, kClasses)
out = np.zeros(kRows)
for r in range(predt.shape[0]):
@@ -144,7 +134,7 @@ def main(args):
m,
num_boost_round=kRounds,
obj=softprob_obj,
custom_metric=merror,
feval=merror,
evals_result=custom_results,
evals=[(m, 'train')])
@@ -153,7 +143,6 @@ def main(args):
native_results = {}
# Use the same objective function defined in XGBoost.
booster_native = xgb.train({'num_class': kClasses,
"objective": "multi:softmax",
'eval_metric': 'merror'},
m,
num_boost_round=kRounds,

View File

@@ -1,7 +1,6 @@
"""
This script demonstrate how to access the eval metrics
======================================================
"""
##
# This script demonstrate how to access the eval metrics in xgboost
##
import os
import xgboost as xgb

View File

@@ -1,37 +1,30 @@
"""
Experimental support for external memory
========================================
This is similar to the one in `quantile_data_iterator.py`, but for external memory
instead of Quantile DMatrix. The feature is not ready for production use yet.
"""Experimental support for external memory. This is similar to the one in
`quantile_data_iterator.py`, but for external memory instead of Quantile DMatrix. The
feature is not ready for production use yet.
.. versionadded:: 1.5.0
See :doc:`the tutorial </tutorials/external_memory>` for more details.
"""
import os
import xgboost
from typing import Callable, List, Tuple
from sklearn.datasets import make_regression
import tempfile
import numpy as np
def make_batches(
n_samples_per_batch: int, n_features: int, n_batches: int, tmpdir: str,
) -> List[Tuple[str, str]]:
files: List[Tuple[str, str]] = []
n_samples_per_batch: int, n_features: int, n_batches: int
) -> Tuple[List[np.ndarray], List[np.ndarray]]:
"""Generate random batches."""
X = []
y = []
rng = np.random.RandomState(1994)
for i in range(n_batches):
X, y = make_regression(n_samples_per_batch, n_features, random_state=rng)
X_path = os.path.join(tmpdir, "X-" + str(i) + ".npy")
y_path = os.path.join(tmpdir, "y-" + str(i) + ".npy")
np.save(X_path, X)
np.save(y_path, y)
files.append((X_path, y_path))
return files
_X = rng.randn(n_samples_per_batch, n_features)
_y = rng.randn(n_samples_per_batch)
X.append(_X)
y.append(_y)
return X, y
class Iterator(xgboost.DataIter):
@@ -45,8 +38,8 @@ class Iterator(xgboost.DataIter):
def load_file(self) -> Tuple[np.ndarray, np.ndarray]:
X_path, y_path = self._file_paths[self._it]
X = np.load(X_path)
y = np.load(y_path)
X = np.loadtxt(X_path)
y = np.loadtxt(y_path)
assert X.shape[0] == y.shape[0]
return X, y
@@ -73,21 +66,24 @@ class Iterator(xgboost.DataIter):
def main(tmpdir: str) -> xgboost.Booster:
# generate some random data for demo
files = make_batches(1024, 17, 31, tmpdir)
batches = make_batches(1024, 17, 31)
files = []
for i, (X, y) in enumerate(zip(*batches)):
X_path = os.path.join(tmpdir, "X-" + str(i) + ".txt")
np.savetxt(X_path, X)
y_path = os.path.join(tmpdir, "y-" + str(i) + ".txt")
np.savetxt(y_path, y)
files.append((X_path, y_path))
it = Iterator(files)
# For non-data arguments, specify it here once instead of passing them by the `next`
# method.
missing = np.NaN
Xy = xgboost.DMatrix(it, missing=missing, enable_categorical=False)
# Other tree methods including ``hist`` and ``gpu_hist`` also work, see tutorial in
# doc for details.
booster = xgboost.train(
{"tree_method": "approx", "max_depth": 2},
Xy,
evals=[(Xy, "Train")],
num_boost_round=10,
)
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some
# caveats. This is still an experimental feature.
booster = xgboost.train({"tree_method": "approx"}, Xy)
return booster

View File

@@ -1,6 +1,4 @@
'''
Demo for using feature weight to change column sampling
=======================================================
'''Using feature weight to change column sampling.
.. versionadded:: 1.3.0
'''
@@ -27,7 +25,7 @@ def main(args):
dtrain.set_info(feature_weights=fw)
bst = xgboost.train({'tree_method': 'hist',
'colsample_bynode': 0.2},
'colsample_bynode': 0.5},
dtrain, num_boost_round=10,
evals=[(dtrain, 'd')])
feature_map = bst.get_fscore()

View File

@@ -1,7 +1,3 @@
"""
Demo for gamma regression
=========================
"""
import xgboost as xgb
import numpy as np

View File

@@ -1,7 +1,3 @@
"""
Demo for GLM
============
"""
import os
import xgboost as xgb
##

View File

@@ -1,111 +0,0 @@
"""
A demo for multi-output regression
==================================
The demo is adopted from scikit-learn:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_regression_multioutput.html#sphx-glr-auto-examples-ensemble-plot-random-forest-regression-multioutput-py
See :doc:`/tutorials/multioutput` for more information.
"""
import argparse
from typing import Dict, Tuple, List
import numpy as np
from matplotlib import pyplot as plt
import xgboost as xgb
def plot_predt(y: np.ndarray, y_predt: np.ndarray, name: str) -> None:
s = 25
plt.scatter(y[:, 0], y[:, 1], c="navy", s=s, edgecolor="black", label="data")
plt.scatter(
y_predt[:, 0], y_predt[:, 1], c="cornflowerblue", s=s, edgecolor="black"
)
plt.xlim([-1, 2])
plt.ylim([-1, 2])
plt.show()
def gen_circle() -> Tuple[np.ndarray, np.ndarray]:
"Generate a sample dataset that y is a 2 dim circle."
rng = np.random.RandomState(1994)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y[::5, :] += 0.5 - rng.rand(20, 2)
y = y - y.min()
y = y / y.max()
return X, y
def rmse_model(plot_result: bool):
"""Draw a circle with 2-dim coordinate as target variables."""
X, y = gen_circle()
# Train a regressor on it
reg = xgb.XGBRegressor(tree_method="hist", n_estimators=64)
reg.fit(X, y, eval_set=[(X, y)])
y_predt = reg.predict(X)
if plot_result:
plot_predt(y, y_predt, "multi")
def custom_rmse_model(plot_result: bool) -> None:
"""Train using Python implementation of Squared Error."""
# As the experimental support status, custom objective doesn't support matrix as
# gradient and hessian, which will be changed in future release.
def gradient(predt: np.ndarray, dtrain: xgb.DMatrix) -> np.ndarray:
"""Compute the gradient squared error."""
y = dtrain.get_label().reshape(predt.shape)
return (predt - y).reshape(y.size)
def hessian(predt: np.ndarray, dtrain: xgb.DMatrix) -> np.ndarray:
"""Compute the hessian for squared error."""
return np.ones(predt.shape).reshape(predt.size)
def squared_log(
predt: np.ndarray, dtrain: xgb.DMatrix
) -> Tuple[np.ndarray, np.ndarray]:
grad = gradient(predt, dtrain)
hess = hessian(predt, dtrain)
return grad, hess
def rmse(predt: np.ndarray, dtrain: xgb.DMatrix) -> Tuple[str, float]:
y = dtrain.get_label().reshape(predt.shape)
v = np.sqrt(np.sum(np.power(y - predt, 2)))
return "PyRMSE", v
X, y = gen_circle()
Xy = xgb.DMatrix(X, y)
results: Dict[str, Dict[str, List[float]]] = {}
# Make sure the `num_target` is passed to XGBoost when custom objective is used.
# When builtin objective is used, XGBoost can figure out the number of targets
# automatically.
booster = xgb.train(
{
"tree_method": "hist",
"num_target": y.shape[1],
},
dtrain=Xy,
num_boost_round=100,
obj=squared_log,
evals=[(Xy, "Train")],
evals_result=results,
custom_metric=rmse,
)
y_predt = booster.inplace_predict(X)
if plot_result:
plot_predt(y, y_predt, "multi")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--plot", choices=[0, 1], type=int, default=1)
args = parser.parse_args()
# Train with builtin RMSE objective
rmse_model(args.plot == 1)
# Train with custom objective.
custom_rmse_model(args.plot == 1)

View File

@@ -1,7 +1,3 @@
"""
Demo for prediction using number of trees
=========================================
"""
import os
import numpy as np
import xgboost as xgb

View File

@@ -1,7 +1,3 @@
"""
Demo for obtaining leaf index
=============================
"""
import os
import xgboost as xgb
@@ -16,9 +12,7 @@ bst = xgb.train(param, dtrain, num_round, watchlist)
print('start testing predict the leaf indices')
# predict using first 2 tree
leafindex = bst.predict(
dtest, iteration_range=(0, 2), pred_leaf=True, strict_shape=True
)
leafindex = bst.predict(dtest, ntree_limit=2, pred_leaf=True)
print(leafindex.shape)
print(leafindex)
# predict all trees

View File

@@ -1,6 +1,4 @@
'''
Demo for using data iterator with Quantile DMatrix
==================================================
'''A demo for defining data iterator.
.. versionadded:: 1.2.0

View File

@@ -1,7 +1,6 @@
"""
Demo for accessing the xgboost eval metrics by using sklearn interface
======================================================================
"""
##
# This script demonstrate how to access the xgboost eval metrics by using sklearn
##
import xgboost as xgb
import numpy as np

View File

@@ -1,7 +1,4 @@
'''
Collection of examples for using sklearn interface
==================================================
Created on 1 Apr 2015
@author: Jamie Hall
@@ -12,7 +9,7 @@ import xgboost as xgb
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, fetch_california_housing
from sklearn.datasets import load_iris, load_digits, load_boston
rng = np.random.RandomState(31337)
@@ -38,8 +35,10 @@ for train_index, test_index in kf.split(X):
actuals = y[test_index]
print(confusion_matrix(actuals, predictions))
print("California Housing: regression")
X, y = fetch_california_housing(return_X_y=True)
print("Boston Housing: regression")
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, shuffle=True, random_state=rng)
for train_index, test_index in kf.split(X):
xgb_model = xgb.XGBRegressor(n_jobs=1).fit(X[train_index], y[train_index])
@@ -48,6 +47,8 @@ for train_index, test_index in kf.split(X):
print(mean_squared_error(actuals, predictions))
print("Parameter optimization")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor(n_jobs=1)
clf = GridSearchCV(xgb_model,
{'max_depth': [2, 4, 6],
@@ -59,8 +60,8 @@ print(clf.best_params_)
# The sklearn API models are picklable
print("Pickling sklearn API models")
# must open in binary format to pickle
pickle.dump(clf, open("best_calif.pkl", "wb"))
clf2 = pickle.load(open("best_calif.pkl", "rb"))
pickle.dump(clf, open("best_boston.pkl", "wb"))
clf2 = pickle.load(open("best_boston.pkl", "rb"))
print(np.allclose(clf.predict(X), clf2.predict(X)))
# Early-stopping

View File

@@ -1,15 +1,14 @@
"""
Demo for using xgboost with sklearn
===================================
"""
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
import xgboost as xgb
import multiprocessing
if __name__ == "__main__":
print("Parallel Parameter optimization")
X, y = fetch_california_housing(return_X_y=True)
boston = load_boston()
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor(n_jobs=multiprocessing.cpu_count() // 2)
clf = GridSearchCV(xgb_model, {'max_depth': [2, 4, 6],
'n_estimators': [50, 100, 200]}, verbose=1,

View File

@@ -1,21 +1,17 @@
"""
Demo for using `process_type` with `prune` and `refresh`
========================================================
Modifying existing trees is not a well established use for XGBoost, so feel free to
experiment.
"""Demo for using `process_type` with `prune` and `refresh`. Modifying existing trees is
not a well established use for XGBoost, so feel free to experiment.
"""
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
import numpy as np
def main():
n_rounds = 32
X, y = fetch_california_housing(return_X_y=True)
X, y = load_boston(return_X_y=True)
# Train a model first
X_train = X[: X.shape[0] // 2]

View File

@@ -1,102 +0,0 @@
import os
import sys
import errno
import subprocess
import glob
import shutil
from contextlib import contextmanager
def normpath(path):
"""Normalize UNIX path to a native path."""
normalized = os.path.join(*path.split("/"))
if os.path.isabs(path):
return os.path.abspath("/") + normalized
else:
return normalized
def cp(source, target):
source = normpath(source)
target = normpath(target)
print("cp {0} {1}".format(source, target))
shutil.copy(source, target)
def maybe_makedirs(path):
path = normpath(path)
print("mkdir -p " + path)
try:
os.makedirs(path)
except OSError as e:
if e.errno != errno.EEXIST:
raise
@contextmanager
def cd(path):
path = normpath(path)
cwd = os.getcwd()
os.chdir(path)
print("cd " + path)
try:
yield path
finally:
os.chdir(cwd)
def run(command, **kwargs):
print(command)
subprocess.check_call(command, shell=True, **kwargs)
def main():
with cd("jvm-packages/"):
print("====copying pure-Python tracker====")
for use_cuda in [True, False]:
xgboost4j = "xgboost4j-gpu" if use_cuda else "xgboost4j"
cp("../python-package/xgboost/tracker.py", f"{xgboost4j}/src/main/resources")
print("====copying resources for testing====")
with cd("../demo/CLI/regression"):
run(f"{sys.executable} mapfeat.py")
run(f"{sys.executable} mknfold.py machine.txt 1")
for use_cuda in [True, False]:
xgboost4j = "xgboost4j-gpu" if use_cuda else "xgboost4j"
xgboost4j_spark = "xgboost4j-spark-gpu" if use_cuda else "xgboost4j-spark"
maybe_makedirs(f"{xgboost4j}/src/test/resources")
maybe_makedirs(f"{xgboost4j_spark}/src/test/resources")
for file in glob.glob("../demo/data/agaricus.*"):
cp(file, f"{xgboost4j}/src/test/resources")
cp(file, f"{xgboost4j_spark}/src/test/resources")
for file in glob.glob("../demo/CLI/regression/machine.txt.t*"):
cp(file, f"{xgboost4j_spark}/src/test/resources")
print("====Creating directories to hold native binaries====")
for os, arch in [("linux", "x86_64"), ("windows", "x86_64"), ("macos", "x86_64")]:
output_dir = f"xgboost4j/src/main/resources/lib/{os}/{arch}"
maybe_makedirs(output_dir)
for os, arch in [("linux", "x86_64")]:
output_dir = f"xgboost4j-gpu/src/main/resources/lib/{os}/{arch}"
maybe_makedirs(output_dir)
print("====Next Steps====")
print("1. Gain upload right to Maven Central repo.")
print("1-1. Sign up for a JIRA account at Sonatype: ")
print("1-2. File a JIRA ticket: "
"https://issues.sonatype.org/secure/CreateIssue.jspa?issuetype=21&pid=10134. Example: "
"https://issues.sonatype.org/browse/OSSRH-67724")
print("2. Store the Sonatype credentials in .m2/settings.xml. See insturctions in "
"https://central.sonatype.org/publish/publish-maven/")
print("3. Obtain Linux and Windows binaries from the CI server")
print("3-1. Get xgboost4j_[commit].dll from "
"https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/list.html. Rename it to"
"xgboost4j.dll.")
print("3-2. For Linux binaries, go to "
"https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html and navigate to the "
"release/ directory. Find and download two JAR files: xgboost4j_2.12-[version].jar and "
"xgboost4j-gpu_2.12-[version].jar. Use unzip command to extract libxgboost4j.so (one "
"version compiled with GPU support and another compiled without).")
print("4. Put the binaries in xgboost4j(-gpu)/src/main/resources/lib/[os]/[arch]")
print("5. Now on a Mac machine, run:")
print(" GPG_TTY=$(tty) mvn deploy -Prelease -DskipTests")
print("6. Log into https://oss.sonatype.org/. On the left menu panel, click Staging "
"Repositories. Visit the URL https://oss.sonatype.org/content/repositories/mldmlc-1085 "
"to inspect the staged JAR files. Finally, press Release button to publish the "
"artifacts to the Maven Central repository.")
if __name__ == "__main__":
main()

View File

@@ -3,7 +3,6 @@
tqdm, sh are required to run this script.
"""
from urllib.request import urlretrieve
from typing import cast, Tuple
import argparse
from typing import List
from sh.contrib import git
@@ -34,7 +33,6 @@ def show_progress(block_num, block_size, total_size):
def retrieve(url, filename=None):
print(f"{url} -> {filename}")
return urlretrieve(url, filename, reporthook=show_progress)
@@ -51,7 +49,7 @@ def download_wheels(
dir_URL: str,
src_filename_prefix: str,
target_filename_prefix: str,
) -> List[str]:
) -> List:
"""Download all binary wheels. dir_URL is the URL for remote directory storing the release
wheels
@@ -65,6 +63,7 @@ def download_wheels(
target_wheel = target_filename_prefix + platform + ".whl"
filename = os.path.join(DIST, target_wheel)
filenames.append(filename)
print("Downloading from:", url, "to:", filename)
retrieve(url=url, filename=filename)
ret = subprocess.run(["twine", "check", filename], capture_output=True)
assert ret.returncode == 0, "Failed twine check"
@@ -75,14 +74,29 @@ def download_wheels(
return filenames
def download_py_packages(major: int, minor: int, commit_hash: str):
def check_path():
root = os.path.abspath(os.path.curdir)
assert os.path.basename(root) == "xgboost", "Must be run on project root."
def main(args: argparse.Namespace) -> None:
check_path()
rel = version.StrictVersion(args.release)
platforms = [
"win_amd64",
"manylinux2014_x86_64",
"manylinux2014_aarch64",
"macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64",
"macosx_12_0_arm64"
"macosx_10_14_x86_64.macosx_10_15_x86_64.macosx_11_0_x86_64",
]
print("Release:", rel)
major, minor, patch = rel.version
branch = "release_" + str(major) + "." + str(minor) + ".0"
git.clean("-xdf")
git.checkout(branch)
git.pull("origin", branch)
git.submodule("update")
commit_hash = latest_hash()
dir_URL = PREFIX + str(major) + "." + str(minor) + ".0" + "/"
src_filename_prefix = "xgboost-" + args.release + "%2B" + commit_hash + "-py3-none-"
@@ -105,74 +119,10 @@ Following steps should be done manually:
)
def download_r_packages(release: str, rc: str, commit: str) -> None:
platforms = ["win64", "linux"]
dirname = "./r-packages"
if not os.path.exists(dirname):
os.mkdir(dirname)
filenames = []
for plat in platforms:
url = f"{PREFIX}{release}/xgboost_r_gpu_{plat}_{commit}.tar.gz"
if not rc:
filename = f"xgboost_r_gpu_{plat}_{release}.tar.gz"
else:
filename = f"xgboost_r_gpu_{plat}_{release}-{rc}.tar.gz"
target = os.path.join(dirname, filename)
retrieve(url=url, filename=target)
filenames.append(target)
print("Finished downloading R packages:", filenames)
def check_path():
root = os.path.abspath(os.path.curdir)
assert os.path.basename(root) == "xgboost", "Must be run on project root."
def main(args: argparse.Namespace) -> None:
check_path()
rel = version.LooseVersion(args.release)
print("Release:", rel)
if len(rel.version) == 3:
# Major release
major, minor, patch = version.StrictVersion(args.release).version
rc = None
rc_ver = None
else:
# RC release
major, minor, patch, rc, rc_ver = cast(
Tuple[int, int, int, str, int], rel.version
)
assert rc == "rc"
release = str(major) + "." + str(minor) + "." + str(patch)
branch = "release_" + release
git.clean("-xdf")
git.checkout(branch)
git.pull("origin", branch)
git.submodule("update")
commit_hash = latest_hash()
download_r_packages(
release, "" if rc is None else rc + str(rc_ver), commit_hash
)
download_py_packages(major, minor, commit_hash)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--release",
type=str,
required=True,
help="Version tag, e.g. '1.3.2', or '1.5.0rc1'"
"--release", type=str, required=True, help="Version tag, e.g. '1.3.2'."
)
args = parser.parse_args()
main(args)

View File

@@ -1,31 +1,23 @@
@import url('theme.css');
/* Logo background */
.wy-side-nav-search, .wy-side-nav-search img {
background-color: #ffffff !important;
div.breathe-sectiondef.container {
width: 100%;
}
.highlight {
background: #f1f3f4;
div.literal-block-wrapper.container {
width: 100%;
}
.navbar {
background: #ffffff;
.red {
color: red;
}
.navbar-nav {
background: #ffffff;
table {
border: 0;
}
/* side bar */
.wy-nav-side {
background: #f1f3f4;
}
.wy-menu-vertical a {
color: #707070;
}
.wy-side-nav-search div.version {
color: #404040;
td, th {
padding: 1px 8px 1px 5px;
border-top: 0;
border-bottom: 1px solid #aaa;
border-left: 0;
border-right: 0;
}

View File

@@ -13,8 +13,6 @@ systems. If the instructions do not work for you, please feel free to ask quest
.. contents:: Contents
.. _get_source:
*************************
Obtaining the Source Code
*************************
@@ -54,7 +52,7 @@ This shared library is used by different language bindings (with some additions
on the binding you choose). The minimal building requirement is
- A recent C++ compiler supporting C++11 (g++-5.0 or higher)
- CMake 3.14 or higher.
- CMake 3.13 or higher.
For a list of CMake options like GPU support, see ``#-- Options`` in CMakeLists.txt on top
level of source tree.
@@ -81,11 +79,33 @@ Obtain ``libomp`` from `Homebrew <https://brew.sh/>`_:
brew install libomp
Rest is the same as building on Linux.
Now clone the repository:
.. code-block:: bash
git clone --recursive https://github.com/dmlc/xgboost
Create the ``build/`` directory and invoke CMake. After invoking CMake, you can build XGBoost with ``make``:
.. code-block:: bash
mkdir build
cd build
cmake ..
make -j4
You may now continue to :ref:`build_python`.
Building on Windows
===================
You need to first clone the XGBoost repo with ``--recursive`` option, to clone the submodules.
We recommend you use `Git for Windows <https://git-for-windows.github.io/>`_, as it comes with a standard Bash shell. This will highly ease the installation process.
.. code-block:: bash
git submodule init
git submodule update
XGBoost support compilation with Microsoft Visual Studio and MinGW. To build with Visual
Studio, we will need CMake. Make sure to install a recent version of CMake. Then run the
@@ -157,6 +177,14 @@ On Windows, run CMake as follows:
(Change the ``-G`` option appropriately if you have a different version of Visual Studio installed.)
.. note:: Visual Studio 2017 Win64 Generator may not work
Choosing the Visual Studio 2017 generator may cause compilation failure. When it happens, specify the 2015 compiler by adding the ``-T`` option:
.. code-block:: bash
cmake .. -G"Visual Studio 15 2017 Win64" -T v140,cuda=8.0 -DUSE_CUDA=ON
The above cmake configuration run will create an ``xgboost.sln`` solution file in the build directory. Build this solution in release mode as a x64 build, either from Visual studio or from command line:
.. code-block:: bash
@@ -300,9 +328,9 @@ So you may want to build XGBoost with GCC own your own risk. This presents some
4. Don't use ``-march=native`` gcc flag. Using it causes the Python interpreter to crash if the DLL was actually used.
5. You may need to provide the lib with the runtime libs. If ``mingw32/bin`` is not in ``PATH``, build a wheel (``python setup.py bdist_wheel``), open it with an archiver and put the needed dlls to the directory where ``xgboost.dll`` is situated. Then you can install the wheel with ``pip``.
******************************
Building R Package From Source
******************************
*******************************
Building R Package From Source.
*******************************
By default, the package installed by running ``install.packages`` is built from source.
Here we list some other options for installing development version.
@@ -313,28 +341,23 @@ Installing the development version (Linux / Mac OSX)
Make sure you have installed git and a recent C++ compiler supporting C++11 (See above
sections for requirements of building C++ core).
Due to the use of git-submodules, ``devtools::install_github`` can no longer be used to
install the latest version of R package. Thus, one has to run git to check out the code
first, see :ref:`get_source` on how to initialize the git repository for XGBoost. The
simplest way to install the R package after obtaining the source code is:
.. code-block:: bash
cd R-package
R CMD INSTALL .
But if you want to use CMake build for better performance (which has the logic for
detecting available CPU instructions) or greater flexibility around compile flags, the
above snippet can be replaced by:
Due to the use of git-submodules, ``devtools::install_github`` can no longer be used to install the latest version of R package.
Thus, one has to run git to check out the code first:
.. code-block:: bash
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
git submodule init
git submodule update
mkdir build
cd build
cmake .. -DR_LIB=ON
make -j$(nproc)
make install
If all fails, try `Building the shared library`_ to see whether a problem is specific to R
package or not. Notice that the R package is installed by CMake directly.
Installing the development version with Visual Studio (Windows)
===============================================================
@@ -500,7 +523,14 @@ XGBoost uses `Sphinx <https://www.sphinx-doc.org/en/stable/>`_ for documentation
* Python dependencies
Checkout the ``requirements.txt`` file under ``doc/``
- sphinx
- breathe
- guzzle_sphinx_theme
- recommonmark
- mock
- sh
- graphviz
- matplotlib
Under ``xgboost/doc`` directory, run ``make <format>`` with ``<format>`` replaced by the format you want. For a list of supported formats, run ``make help`` under the same directory.

View File

@@ -2,11 +2,11 @@
XGBoost C Package
#################
XGBoost implements a set of C API designed for various bindings, we maintain its stability
and the CMake/make build interface. See :doc:`/tutorials/c_api_tutorial` for an
introduction and ``demo/c-api/`` for related examples. Also one can generate doxygen
document by providing ``-DBUILD_C_DOC=ON`` as parameter to ``CMake`` during build, or
simply look at function comments in ``include/xgboost/c_api.h``.
XGBoost implements a set of C API designed for various bindings, we maintain its
stability and the CMake/make build interface. See ``demo/c-api/README.md`` for an
overview and related examples. Also one can generate doxygen document by providing
``-DBUILD_C_DOC=ON`` as parameter to ``CMake`` during build, or simply look at function
comments in ``include/xgboost/c_api.h``.
* `C API documentation (latest master branch) <https://xgboost.readthedocs.io/en/latest/dev/c__api_8h.html>`_
* `C API documentation (last stable release) <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html>`_

View File

@@ -19,6 +19,7 @@ import sys
import re
import os
import subprocess
import guzzle_sphinx_theme
git_branch = os.getenv('SPHINX_GIT_BRANCH', default=None)
if not git_branch:
@@ -32,7 +33,6 @@ if not git_branch:
else:
git_branch = [git_branch]
print('git_branch = {}'.format(git_branch[0]))
try:
filename, _ = urllib.request.urlretrieve(
'https://s3-us-west-2.amazonaws.com/xgboost-docs/{}.tar.bz2'.format(
@@ -62,6 +62,12 @@ libpath = os.path.join(curr_path, '../python-package/')
sys.path.insert(0, libpath)
sys.path.insert(0, curr_path)
# -- mock out modules
import mock # NOQA
MOCK_MODULES = ['scipy', 'scipy.sparse', 'sklearn', 'pandas']
for mod_name in MOCK_MODULES:
sys.modules[mod_name] = mock.Mock()
# -- General configuration ------------------------------------------------
# General information about the project.
@@ -84,19 +90,10 @@ extensions = [
'sphinx.ext.napoleon',
'sphinx.ext.mathjax',
'sphinx.ext.intersphinx',
"sphinx_gallery.gen_gallery",
'breathe',
'recommonmark'
]
sphinx_gallery_conf = {
# path to your example scripts
"examples_dirs": ["../demo/guide-python", "../demo/dask"],
# path to where to save gallery generated output
"gallery_dirs": ["python/examples", "python/dask-examples"],
"matplotlib_animations": True,
}
autodoc_typehints = "description"
graphviz_output_format = 'png'
@@ -172,13 +169,17 @@ todo_include_todos = False
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = "sphinx_rtd_theme"
html_theme_options = {"logo_only": True}
html_theme_path = guzzle_sphinx_theme.html_theme_path()
html_theme = 'guzzle_sphinx_theme'
# Register the theme as an extension to generate a sitemap.xml
extensions.append("guzzle_sphinx_theme")
html_logo = "https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png"
html_css_files = ["css/custom.css"]
# Guzzle theme options (see theme.conf for more information)
html_theme_options = {
# Set the name of the project to appear in the sidebar
"project_nav_name": "XGBoost"
}
html_sidebars = {
'**': ['logo-text.html', 'globaltoc.html', 'searchbox.html']
@@ -200,17 +201,16 @@ latex_elements = {
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, '%s.tex' % project, project, author, 'manual'),
(master_doc, '%s.tex' % project, project,
author, 'manual'),
]
intersphinx_mapping = {
"python": ("https://docs.python.org/3.6", None),
"numpy": ("https://docs.scipy.org/doc/numpy/", None),
"scipy": ("https://docs.scipy.org/doc/scipy/reference/", None),
"pandas": ("http://pandas-docs.github.io/pandas-docs-travis/", None),
"sklearn": ("https://scikit-learn.org/stable", None),
"dask": ("https://docs.dask.org/en/stable/", None),
"distributed": ("https://distributed.dask.org/en/stable/", None),
'python': ('https://docs.python.org/3.6', None),
'numpy': ('http://docs.scipy.org/doc/numpy/', None),
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
'pandas': ('http://pandas-docs.github.io/pandas-docs-travis/', None),
'sklearn': ('http://scikit-learn.org/stable', None)
}

View File

@@ -25,15 +25,3 @@ requests and every update to branches. A few tests however require manual activa
details about noLD. This is a requirement for keeping XGBoost on CRAN (the R package index).
To invoke this test suite for a particular pull request, simply add a review comment
``/gha run r-nold-test``. (Ordinary comment won't work. It needs to be a review comment.)
GitHub Actions is also used to build Python wheels targeting MacOS Intel and Apple Silicon. See
`.github/workflows/python_wheels.yml
<https://github.com/dmlc/xgboost/tree/master/.github/workflows/python_wheels.yml>`_. The
``python_wheels`` pipeline sets up environment variables prefixed ``CIBW_*`` to indicate the target
OS and processor. The pipeline then invokes the script ``build_python_wheels.sh``, which in turns
calls ``cibuildwheel`` to build the wheel. The ``cibuildwheel`` is a library that sets up a
suitable Python environment for each OS and processor target. Since we don't have Apple Silion
machine in GitHub Actions, cross-compilation is needed; ``cibuildwheel`` takes care of the complex
task of cross-compiling a Python wheel. (Note that ``cibuildwheel`` will call
``setup.py bdist_wheel``. Since XGBoost has a native library component, ``setup.py`` contains
a glue code to call CMake and a C++ compiler to build the native library on the fly.)

View File

@@ -134,49 +134,3 @@ Similarly, if you want to exclude C++ source from linting:
cd /path/to/xgboost/
python3 tests/ci_build/tidy.py --cpp=0
**********************************
Guide for handling user input data
**********************************
This is an in-comprehensive guide for handling user input data. XGBoost has wide verity
of native supported data structures, mostly come from higher level language bindings. The
inputs ranges from basic contiguous 1 dimension memory buffer to more sophisticated data
structures like columnar data with validity mask. Raw input data can be used in 2 places,
firstly it's the construction of various ``DMatrix``, secondly it's the in-place
prediction. For plain memory buffer, there's not much to discuss since it's just a
pointer with a size. But for general n-dimension array and columnar data, there are many
subtleties. XGBoost has 3 different data structures for handling optionally masked arrays
(tensors), for consuming user inputs ``ArrayInterface`` should be chosen. There are many
existing functions that accept only plain pointer due to legacy reasons (XGBoost started
as a much simpler library and didn't care about memory usage that much back then). The
``ArrayInterface`` is a in memory representation of ``__array_interface__`` protocol
defined by numpy or the ``__cuda_array_interface__`` defined by numba. Following is a
check list of things to have in mind when accepting related user inputs:
- [ ] Is it strided? (identified by the ``strides`` field)
- [ ] If it's a vector, is it row vector or column vector? (Identified by both ``shape``
and ``strides``).
- [ ] Is the data type supported? Half type and 128 integer types should be converted
before going into XGBoost.
- [ ] Does it have higher than 1 dimension? (identified by ``shape`` field)
- [ ] Are some of dimensions trivial? (shape[dim] <= 1)
- [ ] Does it have mask? (identified by ``mask`` field)
- [ ] Can the mask be broadcasted? (unsupported at the moment)
- [ ] Is it on CUDA memory? (identified by ``data`` field, and optionally ``stream``)
Most of the checks are handled by the ``ArrayInterface`` during construction, except for
the data type issue since it doesn't know how to cast such pointers with C builtin types.
But for safety reason one should still try to write related tests for the all items. The
data type issue should be taken care of in language binding for each of the specific data
input. For single-chunk columnar format, it's just a masked array for each column so it
should be treated uniformly as normal array. For input predictor ``X``, we have adapters
for each type of input. Some are composition of the others. For instance, CSR matrix has 3
potentially strided arrays for ``indptr``, ``indices`` and ``values``. No assumption
should be made to these components (all the check boxes should be considered). Slicing row
of CSR matrix should calculate the offset of each field based on respective strides.
For meta info like labels, which is growing both in size and complexity, we accept only
masked array at the moment (no specialized adapter). One should be careful about the
input data shape. For base margin it can be 2 dim or higher if we have multiple targets in
the future. The getters in ``DMatrix`` returns only 1 dimension flatten vectors at the
moment, which can be improved in the future when it's needed.

View File

@@ -19,7 +19,7 @@ Documents
make html
inside the ``doc/`` directory. The online document is hosted by `Read the Docs <https://readthedocs.org/>`__ where the imported project is managed by `Hyunsu Cho <https://github.com/hcho3>`__ and `Jiaming Yuan <https://github.com/trivialfis>`__.
inside the ``doc/`` directory.
********
Examples

View File

@@ -18,22 +18,13 @@ Making a Release
1. Create an issue for the release, noting the estimated date and expected features or major fixes, pin that issue.
2. Bump release version.
1. Modify ``CMakeLists.txt`` in source tree and ``cmake/Python_version.in`` if needed, run CMake.
2. Modify ``DESCRIPTION`` in R-package.
3. Run ``change_version.sh`` in ``jvm-packages/dev``
3. Commit the change, create a PR on GitHub on release branch. Port the bumped version to default branch, optionally with the postfix ``SNAPSHOT``.
4. Create a tag on release branch, either on GitHub or locally.
5. Make a release on GitHub tag page, which might be done with previous step if the tag is created on GitHub.
6. Submit pip, CRAN, and Maven packages.
+ The pip package is maintained by `Hyunsu Cho <https://github.com/hcho3>`__ and `Jiaming Yuan <https://github.com/trivialfis>`__. There's a helper script for downloading pre-built wheels and R packages ``xgboost/dev/release-pypi-r.py`` along with simple instructions for using ``twine``.
+ The CRAN package is maintained by `Tong He <https://github.com/hetong007>`_ and `Jiaming Yuan <https://github.com/trivialfis>`__.
Before submitting a release, one should test the package on `R-hub <https://builder.r-hub.io/>`__ and `win-builder <https://win-builder.r-project.org/>`__ first. Please note that the R-hub Windows instance doesn't have the exact same environment as the one hosted on win-builder.
+ The Maven package is maintained by `Nan Zhu <https://github.com/CodingCat>`_ and `Hyunsu Cho <https://github.com/hcho3>`_.
- The pip package is maintained by [Hyunsu Cho](http://hyunsu-cho.io/) and [Jiaming Yuan](https://github.com/trivialfis). There's a helper script for downloading pre-built wheels on ``xgboost/dev/release-pypi.py`` along with simple instructions for using ``twine``.
- The CRAN package is maintained by [Tong He](https://github.com/hetong007).
- The Maven package is maintained by [Nan Zhu](https://github.com/CodingCat).

View File

@@ -4,10 +4,10 @@ XGBoost GPU Support
This page contains information about GPU algorithms supported in XGBoost.
.. note:: CUDA 10.1, Compute Capability 3.5 required
.. note:: CUDA 10.0, Compute Capability 3.5 required
The GPU algorithms in XGBoost require a graphics card with compute capability 3.5 or higher, with
CUDA toolkits 10.1 or later.
CUDA toolkits 10.0 or later.
(See `this list <https://en.wikipedia.org/wiki/CUDA#GPUs_supported>`_ to look up compute capability of your GPU card.)
*********************************************
@@ -179,7 +179,7 @@ Following table shows current support status for evaluation metrics on the GPU.
+------------------------------+-------------+
| auc | |tick| |
+------------------------------+-------------+
| aucpr | |tick| |
| aucpr | |cross| |
+------------------------------+-------------+
| ndcg | |tick| |
+------------------------------+-------------+
@@ -224,19 +224,25 @@ Training time on 1,000,000 rows x 50 columns of random data with 500 boosting it
Memory usage
============
The following are some guidelines on the device memory usage of the `gpu_hist` tree method.
The following are some guidelines on the device memory usage of the `gpu_hist` updater.
If you train xgboost in a loop you may notice xgboost is not freeing device memory after each training iteration. This is because memory is allocated over the lifetime of the booster object and does not get freed until the booster is freed. A workaround is to serialise the booster object after training. See `demo/gpu_acceleration/memory.py` for a simple example.
Memory inside xgboost training is generally allocated for two reasons - storing the dataset and working memory.
The dataset itself is stored on device in a compressed ELLPACK format. The ELLPACK format is a type of sparse matrix that stores elements with a constant row stride. This format is convenient for parallel computation when compared to CSR because the row index of each element is known directly from its address in memory. The disadvantage of the ELLPACK format is that it becomes less memory efficient if the maximum row length is significantly more than the average row length. Elements are quantised and stored as integers. These integers are compressed to a minimum bit length. Depending on the number of features, we usually don't need the full range of a 32 bit integer to store elements and so compress this down. The compressed, quantised ELLPACK format will commonly use 1/4 the space of a CSR matrix stored in floating point.
In some cases the full CSR matrix stored in floating point needs to be allocated on the device. This currently occurs for prediction in multiclass classification. If this is a problem consider setting `'predictor'='cpu_predictor'`. This also occurs when the external data itself comes from a source on device e.g. a cudf DataFrame. These are known issues we hope to resolve.
Working memory is allocated inside the algorithm proportional to the number of rows to keep track of gradients, tree positions and other per row statistics. Memory is allocated for histogram bins proportional to the number of bins, number of features and nodes in the tree. For performance reasons we keep histograms in memory from previous nodes in the tree, when a certain threshold of memory usage is passed we stop doing this to conserve memory at some performance loss.
If you are getting out-of-memory errors on a big dataset, try the or :py:class:`xgboost.DeviceQuantileDMatrix` or :doc:`external memory version </tutorials/external_memory>`.
The quantile finding algorithm also uses some amount of working device memory. It is able to operate in batches, but is not currently well optimised for sparse data.
If you are getting out-of-memory errors on a big dataset, try the :doc:`external memory version </tutorials/external_memory>`.
Developer notes
===============
The application may be profiled with annotations by specifying USE_NTVX to cmake. Regions covered by the 'Monitor' class in CUDA code will automatically appear in the nsight profiler when `verbosity` is set to 3.
The application may be profiled with annotations by specifying USE_NTVX to cmake and providing the path to the stand-alone nvtx header via NVTX_HEADER_DIR. Regions covered by the 'Monitor' class in CUDA code will automatically appear in the nsight profiler.
**********
References

View File

@@ -16,7 +16,7 @@ Stable Release
Python
------
Pre-built binary are uploaded to PyPI (Python Package Index) for each release. Supported platforms are Linux (x86_64, aarch64), Windows (x86_64) and MacOS (x86_64, Apple Silicon).
Pre-built binary are uploaded to PyPI (Python Package Index) for each release. Supported platforms are Linux (x86_64, aarch64), Windows (x86_64) and MacOS (x86_64).
.. code-block:: bash
@@ -29,40 +29,17 @@ into permission errors. Python pre-built binary capability for each platform:
.. |tick| unicode:: U+2714
.. |cross| unicode:: U+2718
+---------------------+---------+----------------------+
| Platform | GPU | Multi-Node-Multi-GPU |
+=====================+=========+======================+
| Linux x86_64 | |tick| | |tick| |
+---------------------+---------+----------------------+
| Linux aarch64 | |cross| | |cross| |
+---------------------+---------+----------------------+
| MacOS x86_64 | |cross| | |cross| |
+---------------------+---------+----------------------+
| MacOS Apple Silicon | |cross| | |cross| |
+---------------------+---------+----------------------+
| Windows | |tick| | |cross| |
+---------------------+---------+----------------------+
Conda
*****
You may use the Conda packaging manager to install XGBoost:
.. code-block:: bash
conda install -c conda-forge py-xgboost
Conda should be able to detect the existence of a GPU on your machine and install the correct variant of XGBoost. If you run into issues, try indicating the variant explicitly:
.. code-block:: bash
# CPU only
conda install -c conda-forge py-xgboost-cpu
# Use NVIDIA GPU
conda install -c conda-forge py-xgboost-gpu
Visit the `Miniconda website <https://docs.conda.io/en/latest/miniconda.html>`_ to obtain Conda.
+-------------------+---------+----------------------+
| Platform | GPU | Multi-Node-Multi-GPU |
+===================+=========+======================+
| Linux x86_64 | |tick| | |tick| |
+-------------------+---------+----------------------+
| Linux aarch64 | |cross| | |cross| |
+-------------------+---------+----------------------+
| MacOS | |cross| | |cross| |
+-------------------+---------+----------------------+
| Windows | |tick| | |cross| |
+-------------------+---------+----------------------+
R
-

View File

@@ -45,7 +45,7 @@ Installation from maven repo
.. note:: Use of Python in XGBoost4J-Spark
By default, we use the tracker in `Python package <https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/tracker.py>`_ to drive the training with XGBoost4J-Spark. It requires Python 3.6+. We also have an experimental Scala version of tracker which can be enabled by passing the parameter ``tracker_conf`` as ``scala``.
By default, we use the tracker in `dmlc-core <https://github.com/dmlc/dmlc-core/tree/master/tracker>`_ to drive the training with XGBoost4J-Spark. It requires Python 2.7+. We also have an experimental Scala version of tracker which can be enabled by passing the parameter ``tracker_conf`` as ``scala``.
Data Preparation
================

View File

@@ -97,7 +97,7 @@
"default_left": {
"type": "array",
"items": {
"type": "integer"
"type": "boolean"
}
},
"categories": {
@@ -168,9 +168,6 @@
"num_trees": {
"type": "string"
},
"num_parallel_tree": {
"type": "string"
},
"size_leaf_vector": {
"type": "string"
}
@@ -207,14 +204,6 @@
}
}
},
"pseduo_huber_param": {
"type": "object",
"properties": {
"huber_slope": {
"type": "string"
}
}
},
"aft_loss_param": {
"type": "object",
"properties": {

Some files were not shown because too many files have changed in this diff Show More