* [CI] Use Vault repository to re-gain access to devtoolset-4
* Use manylinux2010 tag
* Update Dockerfile.jvm
* Fix rename_whl.py
* Upgrade Pip, to handle manylinux2010 tag
* Update insert_vcomp140.py
* Update test_python.sh
* Set output margin to True for custom objective in Python and R.
* Add a demo for writing multi-class custom objective function.
* Run tests on selected demos.
* Group aware GPU weighted sketching.
* Distribute group weights to each data point.
* Relax the test.
* Validate input meta info.
* Fix metainfo copy ctor.
* Add inplace prediction for dask-cudf.
* Remove Dockerfile.release, since it's not used anywhere
* Use Conda exclusively in CUDF and GPU containers
* Improve cupy memory copying.
* Add skip marks to tests.
* Add mgpu-cudf category on the CI to run all distributed tests.
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Ensure that configured header (build_config.h) from dmlc-core is picked up by Rabit and XGBoost
* Check which Rabit target is being used
* Use CMake 3.13 in all Jenkins tests
* Upgrade CMake in Travis CI
* Install CMake using Kitware installer
* Remove existing CMake (3.12.4)
* Use devtoolset-6.
* [CI] Use devtoolset-6 because devtoolset-4 is EOL and no longer available
* CUDA 9.0 doesn't work with devtoolset-6; use devtoolset-4 for GPU build only
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Add bindings for serialization.
* Change `xgb.save.raw' into full serialization instead of simple model.
* Add `xgb.load.raw' for unserialization.
* Run devtools.
* fix type error
* Validate number of features.
* resolve comments
* add feature size for LabelPoint and DataBatch
* pass the feature size to native
* move feature size validating tests into a separate suite
* resolve comments
Co-authored-by: fis <jm.yuan@outlook.com>
* Robust regularization of AFT gradient and hessian
* Fix AFT doc; expose it to tutorial TOC
* Apply robust regularization to uncensored case too
* Revise unit test slightly
* Fix lint
* Update test_survival.py
* Use GradientPairPrecise
* Remove unused variables
* Set default dtor for SimpleDMatrix to initialize default copy ctor, which is
deleted due to unique ptr.
* Remove commented code.
* Remove warning for calling host function (std::max).
* Remove warning for initialization order.
* Remove warning for unused variables.
Normal prediction with DMatrix is now thread safe with locks. Added inplace prediction is lock free thread safe.
When data is on device (cupy, cudf), the returned data is also on device.
* Implementation for numpy, csr, cudf and cupy.
* Implementation for dask.
* Remove sync in simple dmatrix.
* [WIP] Add lower and upper bounds on the label for survival analysis
* Update test MetaInfo.SaveLoadBinary to account for extra two fields
* Don't clear qids_ for version 2 of MetaInfo
* Add SetInfo() and GetInfo() method for lower and upper bounds
* changes to aft
* Add parameter class for AFT; use enum's to represent distribution and event type
* Add AFT metric
* changes to neg grad to grad
* changes to binomial loss
* changes to overflow
* changes to eps
* changes to code refactoring
* changes to code refactoring
* changes to code refactoring
* Re-factor survival analysis
* Remove aft namespace
* Move function bodies out of AFTNormal and AFTLogistic, to reduce clutter
* Move function bodies out of AFTLoss, to reduce clutter
* Use smart pointer to store AFTDistribution and AFTLoss
* Rename AFTNoiseDistribution enum to AFTDistributionType for clarity
The enum class was not a distribution itself but a distribution type
* Add AFTDistribution::Create() method for convenience
* changes to extreme distribution
* changes to extreme distribution
* changes to extreme
* changes to extreme distribution
* changes to left censored
* deleted cout
* changes to x,mu and sd and code refactoring
* changes to print
* changes to hessian formula in censored and uncensored
* changes to variable names and pow
* changes to Logistic Pdf
* changes to parameter
* Expose lower and upper bound labels to R package
* Use example weights; normalize log likelihood metric
* changes to CHECK
* changes to logistic hessian to standard formula
* changes to logistic formula
* Comply with coding style guideline
* Revert back Rabit submodule
* Revert dmlc-core submodule
* Comply with coding style guideline (clang-tidy)
* Fix an error in AFTLoss::Gradient()
* Add missing files to amalgamation
* Address @RAMitchell's comment: minimize future change in MetaInfo interface
* Fix lint
* Fix compilation error on 32-bit target, when size_t == bst_uint
* Allocate sufficient memory to hold extra label info
* Use OpenMP to speed up
* Fix compilation on Windows
* Address reviewer's feedback
* Add unit tests for probability distributions
* Make Metric subclass of Configurable
* Address reviewer's feedback: Configure() AFT metric
* Add a dummy test for AFT metric configuration
* Complete AFT configuration test; remove debugging print
* Rename AFT parameters
* Clarify test comment
* Add a dummy test for AFT loss for uncensored case
* Fix a bug in AFT loss for uncensored labels
* Complete unit test for AFT loss metric
* Simplify unit tests for AFT metric
* Add unit test to verify aggregate output from AFT metric
* Use EXPECT_* instead of ASSERT_*, so that we run all unit tests
* Use aft_loss_param when serializing AFTObj
This is to be consistent with AFT metric
* Add unit tests for AFT Objective
* Fix OpenMP bug; clarify semantics for shared variables used in OpenMP loops
* Add comments
* Remove AFT prefix from probability distribution; put probability distribution in separate source file
* Add comments
* Define kPI and kEulerMascheroni in probability_distribution.h
* Add probability_distribution.cc to amalgamation
* Remove unnecessary diff
* Address reviewer's feedback: define variables where they're used
* Eliminate all INFs and NANs from AFT loss and gradient
* Add demo
* Add tutorial
* Fix lint
* Use 'survival:aft' to be consistent with 'survival:cox'
* Move sample data to demo/data
* Add visual demo with 1D toy data
* Add Python tests
Co-authored-by: Philip Cho <chohyu01@cs.washington.edu>
* Move thread local entry into Learner.
This is an attempt to workaround CUDA context issue in static variable, where
the CUDA context can be released before device vector.
* Add PredictionEntry to thread local entry.
This eliminates one copy of prediction vector.
* Don't define CUDA C API in a namespace.
* - create a gpu metrics (internal) registry
- the objective is to separate the cpu and gpu implementations such that they evolve
indepedently. to that end, this approach will:
- preserve the same metrics configuration (from the end user perspective)
- internally delegate the responsibility to the gpu metrics builder when there is a
valid device present
- decouple the gpu metrics builder from the cpu ones to prevent misuse
- move away from including the cuda file from within the cc file and segregate the code
via ifdef's
* Use pre-rounding based method to obtain reproducible floating point
summation.
* GPU Hist for regression and classification are bit-by-bit reproducible.
* Add doc.
* Switch to thrust reduce for `node_sum_gradient`.
* Add release note for 1.0.0
* Fix a small bug in the Python script that compiles the list of contributors
* Clarify governance of CI infrastructure; now PMC is formally in charge
* Address reviewer comment
* Fix typo
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
used for metric computation
- it also wraps all the bald device pointers into span.
* Remove f-string, since it's not supported by Python 3.5 (#5330)
* Remove f-string, since it's not supported by Python 3.5
* Add Python 3.5 to CI, to ensure compatibility
* Remove duplicated matplotlib
* Show deprecation notice for Python 3.5
* Fix lint
* Fix lint
* Fix a unit test that mistook MINOR ver for PATCH ver
* Enforce only major version in JSON model schema
* Bump version to 1.1.0-SNAPSHOT
* Added a check call macro in jvm package, prevents executing other functions
from jvm when error occurred in XGBoost. For example, when prediction fails jvm
should not try to allocate memory based on the output prediction size.
Move this function into gbtree, and uses only updater for doing so. As now the predictor knows exactly how many trees to predict, there's no need for it to update the prediction cache.
* Move prediction cache into Learner.
* Clean-ups
- Remove duplicated cache in Learner and GBM.
- Remove ad-hoc fix of invalid cache.
- Remove `PredictFromCache` in predictors.
- Remove prediction cache for linear altogether, as it's only moving the
prediction into training process but doesn't provide any actual overall speed
gain.
- The cache is now unique to Learner, which means the ownership is no longer
shared by any other components.
* Changes
- Add version to prediction cache.
- Use weak ptr to check expired DMatrix.
- Pass shared pointer instead of raw pointer.
The setup.py is rewritten. This new script uses only Python code and provide customized
implementation of setuptools commands. This way users can run most of setuptools commands
just like any other Python libraries.
* Remove setup_pip.py
* Remove soft links.
* Define customized commands.
* Remove shell script.
* Remove makefile script.
* Update the doc for building from source.
* Make pip install xgboost*.tar.gz work by fixing build-python.sh
* Simplify install doc
* Add test
* Install Miniconda for Linux target too
* Build XGBoost only once in sdist
* Try importing xgboost after installation
* Don't set PYTHONPATH env var for sdist test
* Turn xgboost::DataType into C++11 enum class
* New binary serialization format for DMatrix::MetaInfo
* Fix clang-tidy
* Fix c++ test
* Implement new format proposal
* Move helper functions to anonymous namespace; remove unneeded field
* Fix lint
* Add shape.
* Keep only roundtrip test.
* Fix test.
* various fixes
* Update data.cc
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Simplify Scikit-Learn parameter management.
* Copy base class for removing duplicated parameter signatures.
* Set all parameters to None.
* Handle None in set_param.
* Extract the doc.
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Simplify DropTrees calling logic
* Add `training` parameter for prediction method.
* [Breaking]: Add `training` to C API.
* Change for R and Python custom objective.
* Correct comment.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Fix syncing DMatrix columns.
* notes for tree method.
* Enable feature validation for all interfaces except for jvm.
* Better tests for boosting from predictions.
* Disable validation on JVM.
* Disable parameter validation for now.
Scikit-Learn passes all parameters down to XGBoost, whether they are used or
not.
* Add option `validate_parameters`.
* - implementation of map ranking algorithm
- also effected necessary suggestions mentioned in the earlier ranking pr's
- made some performance improvements to the ndcg algo as well
* Add OpenMP as CMake target
* Require CMake 3.12, to allow linking OpenMP target to objxgboost
* Specify OpenMP compiler flag for CUDA host compiler
* Require CMake 3.16+ if the OS is Mac OSX
* Use AppleClang in Mac tests.
* Update dmlc-core
* Remove `learning_rates`.
It's been deprecated since we have callback.
* Set `before_iteration` of `reset_learning_rate` to False to preserve
the initial learning rate, and comply to the term "reset".
Closes#4709.
* Tests for various `tree_method`.
* Pass pointer to model parameters.
This PR de-duplicates most of the model parameters except the one in
`tree_model.h`. One difficulty is `base_score` is a model property but can be
changed at runtime by objective function. Hence when performing model IO, we
need to save the one provided by users, instead of the one transformed by
objective. Here we created an immutable version of `LearnerModelParam` that
represents the value of model parameter after configuration.
This PR fixes tree weights in dart being ignored when computing contributions.
* Fix ellpack page source link.
* Add tree weights to compute contribution.
- Install wget explicitly to match openssl.
- Install CMake explicitly.
- Use newer miniconda link.
- Reenable unittests.
- gcc@9 + xcode@10 for osx due to missing <_stdio.h>. Other versions of gcc should also work. But as homebrew pour gcc@9 after update by default, so I just stick with latest version.
- Disabled one external memory test for OSX. Not sure about the thread implementation in there and fixing external memory is beyond the scope of this PR.
- Use Python3 with conda in jvm package.
* Extract interaction constraints from split evaluator.
The reason for doing so is mostly for model IO, where num_feature and interaction_constraints are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win.
* Enable inc for approx tree method.
As now the implementation is spited up from evaluator class, it's also enabled for approx method.
* Removing obsoleted code in colmaker.
They are never documented nor actually used in real world. Also there isn't a single test for those code blocks.
* Unifying the types used for row and column.
As the size of input dataset is marching to billion, incorrect use of int is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.
* Disable rabit mock test for now: See #5012 .
* Disable dask-cudf test at prediction for now: See #5003
* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.
Using AUC and AUC-PR still throws an error. See #4663 for a robust fix.
* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.
* Sync number of columns in DMatrix.
As num_feature is required to be the same across all workers in data split
mode.
* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.
* Fix Jenkins' GPU tests.
* Install dask-cuda from source in Jenkins' test.
Now all tests are actually running.
* Restore GPU Hist tree synchronization test.
* Check UUID of running devices.
The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.
* Fix CMake policy and project variables.
Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.
* Fix copying data to CPU
* Fix race condition in cpu predictor.
* Fix duplicated DMatrix construction.
* Don't download extra nccl in CI script.
* Do not store built artifacts in the Jenkins master
* Add wheel renaming script
* Upload wheels to S3 bucket
* Use env.GIT_COMMIT
* Capture git hash correctly
* Add missing import in Jenkinsfile
* Address reviewer's comments
* Put artifacts for pull requests in separate directory
* No wildcard expansion in Windows CMD
* Use `UpdateAllowUnknown' for non-model related parameter.
Model parameter can not pack an additional boolean value due to binary IO
format. This commit deals only with non-model related parameter configuration.
* Add tidy command line arg for use-dmlc-gtest.
* - pairwise ranking objective implementation on gpu
- there are couple of more algorithms (ndcg and map) for which support will be added
as follow-up pr's
- with no label groups defined, get gradient is 90x faster on gpu (120m instance
mortgage dataset)
- it can perform by an order of magnitude faster with ~ 10 groups (and adequate cores
for the cpu implementation)
* Add JSON config to rank obj.
* Use CMake config file for representing version.
* Generate c and Python version file with CMake.
The generated file is written into source tree. But unless XGBoost upgrades
its version, there will be no actual modification. This retains compatibility
with Makefiles for R.
* Add XGBoost version the DMatrix binaries.
* Simplify prefetch detection in CMakeLists.txt
* Apply Configurable to objective functions.
* Apply Model to Learner and Regtree, gbm.
* Add Load/SaveConfig to objs.
* Refactor obj tests to use smart pointer.
* Dummy methods for Save/Load Model.
* Don't set_params at the end of set_state.
* Also fix another issue found in dask prediction.
* Add note about prediction.
Don't support other prediction modes at the moment.
* Move get transpose into cc.
* Clean up headers in host device vector, remove thrust dependency.
* Move span and host device vector into public.
* Install c++ headers.
* Short notes for c and c++.
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Add BigDenseMatrix
* ability to create DMatrix with bigger than Integer.MAX_VALUE size arrays
* uses sun.misc.Unsafe
* make DMatrix test work from a jar as well
* Add public group getter for java and scala
* Remove unnecessary param from javadoc
* Fix typo
* Fix another typo
* Add semicolon
* Fix javadoc return statement
* Fix missing return statement
* Add a unit test
* Restrict access to `cfg_` in gbm.
* Verify having correct updaters.
* Remove `grow_global_histmaker`
This updater is the same as `grow_histmaker`. The former is not in our
document so we just remove it.
* Initial support for cudf integration.
* Add two C APIs for consuming data and metainfo.
* Add CopyFrom for SimpleCSRSource as a generic function to consume the data.
* Add FromDeviceColumnar for consuming device data.
* Add new MetaInfo::SetInfo for consuming label, weight etc.
* Refactor configuration [Part II].
* General changes:
** Remove `Init` methods to avoid ambiguity.
** Remove `Configure(std::map<>)` to avoid redundant copying and prepare for
parameter validation. (`std::vector` is returned from `InitAllowUnknown`).
** Add name to tree updaters for easier debugging.
* Learner changes:
** Make `LearnerImpl` the only source of configuration.
All configurations are stored and carried out by `LearnerImpl::Configure()`.
** Remove booster in C API.
Originally kept for "compatibility reason", but did not state why. So here
we just remove it.
** Add a `metric_names_` field in `LearnerImpl`.
** Remove `LazyInit`. Configuration will always be lazy.
** Run `Configure` before every iteration.
* Predictor changes:
** Allocate both cpu and gpu predictor.
** Remove cpu_predictor from gpu_predictor.
`GBTree` is now used to dispatch the predictor.
** Remove some GPU Predictor tests.
* IO
No IO changes. The binary model format stability is tested by comparing
hashing value of save models between two commits
* bump scala to 2.12 which requires java 8 and also newer flink and akka
* put scala version in artifactId
* fix appveyor
* fix for scaladoc issue that looks like https://github.com/scala/bug/issues/10509
* fix ci_build
* update versions in generate_pom.py
* fix generate_pom.py
* apache does not have a download for spark 2.4.3 distro using scala 2.12 yet, so for now i use a tgz i put on s3
* Upload spark-2.4.3-bin-scala2.12-hadoop2.7.tgz to our own S3
* Update Dockerfile.jvm_cross
* Update Dockerfile.jvm_cross
* Reorganize contributor's doc
* Address comments from @trivialfis
* Address @sriramch's comment: include ABI compatibility guarantee
* Address @rongou's comment
* Postpone ABI compatibility guarantee for now
* provide the readme
* update for format
* reformat
* reformat -2
* update again
* update format
* update w.r.t yinlou's comments
* Add kubernetes tutorial to Table of Contents
* Style edit
* Fix#4630, #4421: Preserve correct ordering between metrics, and always use last metric for early stopping
* Clarify semantics of early stopping in presence of multiple valid sets and metrics
* Add a test
* Fix lint
* _maybe_pandas_xxx should return their arguments unchanged if no pandas installed
* Tests should not assume pandas is installed
* Mark tests which require pandas as such
* Fix external memory for get column batches.
This fixes two bugs:
* Use PushCSC for get column batches.
* Don't remove the created temporary directory before finishing test.
* Check all pages.
* Add to documentation how to build native unit tests
* Add instructions to run Python tests and to use Docker container [skip ci]
* Fix link to pytest chapter
* Add link to Google Test [skip ci]
* Set PYTHONPATH [skip ci]
* Revise test_python.sh for running tests locally
* Update test_python.sh
* Place Docker recommendation notice in a prominent place [skip ci]
* Initial performance optimizations for xgboost
* remove includes
* revert float->double
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* Check existence of _mm_prefetch and __builtin_prefetch
* Fix lint
* optimizations for CPU
* appling comments in review
* add some comments, code refactoring
* fixing issues in CI
* adding runtime checks
* remove 1 extra check
* remove extra checks in BuildHist
* remove checks
* add debug info
* added debug info
* revert changes
* added comments
* Apply suggestions from code review
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* apply review comments
* Remove unused function CreateNewNodes()
* Add descriptive comment on node_idx variable in QuantileHistMaker::Builder::BuildHistsBatch()
* Implement tree model dump with a code generator.
* Split up generators.
* Implement graphviz generator.
* Use pattern matching.
* [Breaking] Return a Source in `to_graphviz` instead of Digraph in Python package.
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* - do not create device vectors for the entire sparse page while computing histograms...
- while creating the compressed histogram indices, the row vector is created for the entire
sparse page batch. this is needless as we only process chunks at a time based on a slice
of the total gpu memory
- this pr will allocate only as much as required to store the ppropriate row indices and the entries
* - do not dereference row_ptrs once the device_vector has been created to elide host copies of those counts
- instead, grab the entry counts directly from the sparsepage
* - set the appropriate device before freeing device memory...
- pr #4532 added a global memory tracker/logger to keep track of number of (de)allocations
and peak memory usage on a per device basis.
- this pr adds the appropriate check to make sure that the (de)allocation counts and memory usages
makes sense for the device. since verbosity is typically increased on debug/non-retail builds.
* - pre-create cub allocators and reuse them
- create them once and not resize them dynamically. we need to ensure that these allocators
are created and destroyed exactly once so that the appropriate device id's are set
This is part 1 of refactoring configuration.
* Move tree heuristic configurations.
* Split up declarations and definitions for GBTree.
* Implement UseGPU in gbm.
* - training with external memory - part 2 of 2
- when external memory support is enabled, building of histogram indices are
done incrementally for every sparse page
- the entire set of input data is divided across multiple gpu's and the relative
row positions within each device is tracked when building the compressed histogram buffer
- this was tested using a mortgage dataset containing ~ 670m rows before 4xt4's could be
saturated
* Fix C++11 config parser
* Use raw strings to improve readability of regex
* Fix compilation for GCC 5.x
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* simplify the config.h file
* revise config.h
* revised config.h
* revise format
* revise format issues
* revise whitespace issues
* revise whitespace namespace format issues
* revise namespace format issues
* format issues
* format issues
* format issues
* format issues
* Revert submodule changes
* minor change
* Update src/common/config.h
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* address format issue from trivialfis
* Use correct cub submodule
* - training with external memory part 1 of 2
- this pr focuses on computing the quantiles using multiple gpus on a
dataset that uses the external cache capabilities
- there will a follow-up pr soon after this that will support creation
of histogram indices on large dataset as well
- both of these changes are required to support training with external memory
- the sparse pages in dmatrix are taken in batches and the the cut matrices
are incrementally built
- also snuck in some (perf) changes related to sketches aggregation amongst multiple
features across multiple sparse page batches. instead of aggregating the summary
inside each device and merged later, it is aggregated in-place when the device
is working on different rows but the same feature
* Only define `gpu_id` and `n_gpus` in `LearnerTrainParam`
* Pass LearnerTrainParam through XGBoost vid factory method.
* Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively).
* Test learner train param io.
* Fix gpu pickling.
* - fix issues with training with external memory on cpu
- use the batch size to determine the correct number of rows in a batch
- use the right number of threads in omp parallalization if the batch size
is less than the default omp max threads (applicable for the last batch)
* - handle scenarios where last batch size is < available number of threads
- augment tests such that we can test all scenarios (batch size <, >, = number of threads)
* adding support for matrix slicing with query ID for cross-validation
* hail mary test of unrar installation for windows tests
* trying to modify tests to run in Github CI
* Remove dependency on wget and unrar
* Save error log from R test
* Relax assertion in test_training
* Use int instead of bool in C function interface
* Revise R interface
* Add XGDMatrixSliceDMatrixEx and keep old XGDMatrixSliceDMatrix for API compatibility
* Add CMake option to use bundled gtest from dmlc-core, so that it is easy to build XGBoost with gtest on Windows
* Consistently apply OpenMP flag to all targets. Force enable OpenMP when USE_CUDA is turned on.
* Insert vcomp140.dll into Windows wheels
* Add C++ and Python tests for CPU and GPU targets (CUDA 9.0, 10.0, 10.1)
* Prevent spurious msbuild failure
* Add GPU tests
* Upgrade dmlc-core
* Fix#4462: Use /MT flag consistently for MSVC target
* First attempt at Windows CI
* Distinguish stages in Linux and Windows pipelines
* Try running CMake in Windows pipeline
* Add build step
* Automatically set maximize_evaluation_metrics if not explicitly given.
* When custom_eval is set, require maximize_evaluation_metrics.
* Update documents on early stop in XGBoost4J-Spark.
* Fix code error.
* Make CMakeLists.txt compatible with CMake 3.3; require CMake 3.11 for MSVC
* Use CMake 3.12 when sanitizer is enabled
* Disable funroll-loops for MSVC
* Use cmake version in container name
* Add missing arg
* Fix egrep use in ci_build.sh
* Display CMake version
* Do not set OpenMP_CXX_LIBRARIES for MSVC
* Use cmake_minimum_required()
* Use feature interaction constraints to narrow search space for split candidates.
* fix clang-tidy broken at updater_quantile_hist.cc:535:3
* make const
* fix
* try to fix exception thrown in java_test
* fix suspected mistake which cause EvaluateSplit error
* try fix
* Fix bug: feature ID and node ID swapped in argument
* Rename CheckValidation() to CheckFeatureConstraint() for clarity
* Do not create temporary vector validFeatures, to enable parallelism
* Combine thread launches into single launch per tree for gpu_hist
algorithm.
* Address deprecation warning
* Add manual column sampler constructor
* Turn off omp dynamic to get a guaranteed number of threads
* Enable openmp in cuda code
* All Linux tests are now in Jenkins CI
* Tests are now de-coupled from builds. We can now build XGBoost with one version of CUDA/JDK and test it with another version of CUDA/JDK
* Builds (compilation) are significantly faster because 1) They use C5 instances with faster CPU cores; and 2) build environment setup is cached using Docker containers
* fix the nan and non-zero missing value handling
* fix nan handling part
* add missing value
* Update MissingValueHandlingSuite.scala
* Update MissingValueHandlingSuite.scala
* stylistic fix
* [jvm-packages][hot-fix] fix column mismatch caused by zip actions at XGBooostModel.transformInternal
* apply minibatch in prediction
* an iterator-compatible minibatch prediction
* regressor impl
* continuous working on mini-batch prediction of xgboost4j-spark
* Update Booster.java
* Refactor CMake scripts.
* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.
* Revert machine.conf.
* Move CLI test to gpu.
* Small cleanup.
* Support using XGBoost as submodule.
* Fix windows
* Fix cpp tests on Windows
* Remove duplicated find_package.
* [r-package] cut CI-time dependency on craigcitro/r-travis (fixes#4348)
* Install R
* Install R on OSX
* Remove gfortran symlink
* Specify CRAN repo
* added more R dependencies needed for testing
* removed heavy R dependencies in CI
* fixed bug in env var, removed unnecessary apt installs of R
* fix to R installs
The old NativeLibLoader had a short-circuit load path which modified
java.library.path and attempted to load the xgboost library from outside
the jar first, falling back to loading the library from inside the jar.
This path is a no-op every time when using XGBoost outside of it's
source tree. Additionally it triggers an illegal reflective access
warning in the module system in 9, 10, and 11.
On Java 12 the ClassLoader fields are not accessible via reflection
(separately from the illegal reflective acces warning), and so it fails
in a way that isn't caught by the code which falls back to loading the
library from inside the jar.
This commit removes that code path and always loads the xgboost library
from inside the jar file as it's a valid technique across multiple JVM
implementations and works with all versions of Java.
* Fix Histogram allocation.
nidx_map is cleared after `Reset`, but histogram data size isn't changed hence
histogram recycling is used in later iterations. After a reset(building new
tree), newly allocated node will start from 0, while recycling always choose
the node with smallest index, which happens to be our newly allocated node 0.
* When building pull requests, use Docker cache for master branch
Docker build caches are per-branch, so new pull requests will initially
have no build cache, causing the Docker containers to be built from
scratch. New pull requests should use the cache associated with the
master branch. This makes sense, since most pull requests do not modify
the Dockerfile.
* Add comments
* make the assignments of HostDeviceVector exception safe.
* storing a dummy GPUDistribution instance in HDV for CPU based code.
* change testxgboost binary location to build directory.
* Make train in xgboost4j respect print params
Previously no setting in params argument of Booster::train would prevent
the Rabit.trackerPrint call. This can fill up a lot of screen space in
the case that many folds are being trained.
* Setting "silent" in this map to "true", "True", a non-zero integer, or
a string that can be parsed to such an int will prevent printing.
* Setting "verbose_eval" to "False" or "false" will prevent printing.
* Setting "verbose_eval" to an int (or a String parseable to an int) n
will result in printing every n steps, or no printing is n is zero.
This is to match the python behaviour described here:
https://www.kaggle.com/c/rossmann-store-sales/discussion/17499
* Fixed 'slient' typo in xgboost4j test
* private access on two methods
* Optimisations for gpu_hist.
* Use streams to overlap operations.
* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
* Brought the silent parameter for the SKLearn-like API back, marked it deprecated.
- added deprecation notice and warning
- removed silent from the tests for the SKLearn-like API
* Improved multi-node multi-GPU random forests.
- removed rabit::Broadcast() from each invocation of column sampling
- instead, syncing the PRNG seed when a ColumnSampler() object is constructed
- this makes non-trivial column sampling significantly faster in the distributed case
- refactored distributed GPU tests
- added distributed random forests tests
* Upgrade gtest for clang-tidy.
* Use CMake to install GTest instead of mv.
* Don't enforce clang-tidy to return 0 due to errors in thrust.
* Add a small test for tidy itself.
* Reformat.
* Added SKLearn-like random forest Python API.
- added XGBRFClassifier and XGBRFRegressor classes to SKL-like xgboost API
- also added n_gpus and gpu_id parameters to SKL classes
- added documentation describing how to use xgboost for random forests,
as well as existing caveats
* Initial commit to support multi-node multi-gpu xgboost using dask
* Fixed NCCL initialization by not ignoring the opg parameter.
- it now crashes on NCCL initialization, but at least we're attempting it properly
* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers
* Synchronizing in a couple of more places.
- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places
* Added another missing max-allreduce operation inside BuildHistLeftRight
* Removed unnecessary collective operations.
* Simplified rabit::Allreduce() sync of gradient sums.
* Removed unnecessary rabit syncs around ncclAllReduce.
- this improves performance _significantly_ (7x faster for overall training,
20x faster for xgboost proper)
* pulling in latest xgboost
* removing changes to updater_quantile_hist.cc
* changing use_nccl_opg initialization, removing unnecessary if statements
* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId
* placing struct defintion in guard to avoid duplicate code errors
* addressing linting errors
* removing
* removing additional arguments to AllReduer initialization
* removing distributed flag
* making comm init symmetric
* removing distributed flag
* changing ncclCommInit to support multiple modalities
* fix indenting
* updating ncclCommInitRank block with necessary group calls
* fix indenting
* adding print statement, and updating accessor in vector
* improving print statement to end-line
* generalizing nccl_rank construction using rabit
* assume device_ordinals is the same for every node
* test, assume device_ordinals is identical for all nodes
* test, assume device_ordinals is unique for all nodes
* changing names of offset variable to be more descriptive, editing indenting
* wrapping ncclUniqueId GetUniqueId() and aesthetic changes
* adding synchronization, and tests for distributed
* adding to tests
* fixing broken #endif
* fixing initialization of gpu histograms, correcting errors in tests
* adding to contributors list
* adding distributed tests to jenkins
* fixing bad path in distributed test
* debugging
* adding kubernetes for distributed tests
* adding proper import for OrderedDict
* adding urllib3==1.22 to address ordered_dict import error
* added sleep to allow workers to save their models for comparison
* adding name to GPU contributors under docs
* Fix early stop with xgboost4j-spark
* Update XGBoost.java
* Update XGBoost.java
* Update XGBoost.java
To use -Float.MAX_VALUE as the lower bound, in case there is positive metric.
* Only update best score if the current score is better (no update when equal)
* Update xgboost-spark tutorial to fix early stopping docs.
* Fix test_gpu_coordinate.
* Use `gpu_coord_descent` in test.
* Reduce number of running rounds.
* Remove nthread.
* Use githubusercontent for r-appveyor.
* Use githubusercontent in travis r tests.
* Prevent empty quantiles
* Revise and improve unit tests for quantile hist
* Remove unnecessary comment
* Add #2943 as a test case
* Skip test if no sklearn
* Revise misleading comments
* Add checks for group size.
* Simple docs.
* Search group index during hist cut matrix initialization.
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Fix broken R test: Install Homebrew GCC
Missing GCC Fortran causes installation failure of a dependency package
(igraph)
* Register gfortran system-wide
* Use correct keg
* Set env vars to change compiler choice
* Do not break other Mac builds
* Nuclear option: symlink gfortran
* Use /usr/local/bin instead of /usr/bin
* Symlink library path too
* Update run_test.sh
* Remove GHistRow, GHistEntry, GHistIndexRow.
* Remove kSimpleStats.
* Remove CheckInfo, SetLeafVec in GradStats and in SKStats.
* Clean up the GradStats.
* Cleanup calcgain.
* Move LossChangeMissing out of common.
* Remove [] operator from GHistIndexBlock.
* Basic script for using compilation database.
* Add `GENERATE_COMPILATION_DATABASE' to CMake.
* Rearrange CMakeLists.txt.
* Add basic python clang-tidy script.
* Remove modernize-use-auto.
* Add clang-tidy to Jenkins
* Refine logic for correct path detection
In Jenkins, the project root is of form /home/ubuntu/workspace/xgboost_PR-XXXX
* Run clang-tidy in CUDA 9.2 container
* Use clang_tidy container
* Enable xgb_model parameter in XGClassifier scikit-learn API
https://github.com/dmlc/xgboost/issues/3049
* add test_XGBClassifier_resume():
test for xgb_model parameter in XGBClassifier API.
* Update test_with_sklearn.py
* Fix lint
* Fix failing Travis CI on Mac
Use Homebrew Addon + latest Mac image
* Use long command for pytest
* Downgrade OSX image to xcode9.3, to use Java 8
* Install pytest in Python 2 environment
* Remove clang-tidy from Travis
- ./testxgboost (without filters) failed if run on a multi-GPU machine because
the memory was allocated on the current device, but device 0
was always passed into LaunchN
* Initial performance optimizations for xgboost
* remove includes
* revert float->double
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* Check existence of _mm_prefetch and __builtin_prefetch
* Fix lint
* Updates to Booster to support other feature importances
* Add returns for Java methods
* Pass Scala style checks
* Pass Java style checks
* Fix indents
* Use class instead of enum
* Return map string double
* A no longer broken build, thanks to mvn package local build
* Add a unit test to increase code coverage back
* Address code review on main code
* Add more unit tests for different feature importance scores
* Address more CR
* Use Span in gpu coordinate.
* Use Span in device code.
* Fix shard size calculation.
- Use lower_bound instead of upper_bound.
* Check empty devices.
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* wrap iterators
* enable copartition training and validationset
* add parameters
* converge code path and have init unit test
* enable multi evals for ranking
* unit test and doc
* update example
* fix early stopping
* address the offline comments
* udpate doc
* test eval metrics
* fix compilation issue
* fix example
* Unify logging facilities.
* Enhance `ConsoleLogger` to handle different verbosity.
* Override macros from `dmlc`.
* Don't use specialized gamma when building with GPU.
* Remove verbosity cache in monitor.
* Test monitor.
* Deprecate `silent`.
* Fix doc and messages.
* Fix python test.
* Fix silent tests.
* Ensure lists cannot be passed into DMatrix
The documentation does not include lists as an allowed type for the data inputted into DMatrix. Despite this, a list can be passed in without an error. This change would prevent a list form being passed in directly.
* update description of early stopping rounds
the description of early stopping round was quite inconsistent in the scikit-learn api section since the fit paragraph tells that when early stopping rounds occurs, the last iteration is returned not the best one, but the predict paragraph tells that when the predict is called without ntree_limit specified, then ntree_limit is equals to best_ntree_limit.
Thus, when reading the fit part, one could think that it is needed to specify what is the best iter when calling the predict, but when reading the predict part, then the best iter is given by default, it is the last iter that you have to specify if needed.
* Update sklearn.py
* Update sklearn.py
fix doc according to the python_lightweight_test error
* Port elementwise metrics to GPU.
* All elementwise metrics are converted to static polymorphic.
* Create a reducer for metrics reduction.
* Remove const of Metric::Eval to accommodate CubMemory.
- Improved GPU performance logging
- Only use one execute shards function
- Revert performance regression on multi-GPU
- Use threads to launch NCCL AllReduce
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* update version
* 0.82
* fix early stopping condition
* remove unused
* update comments
* udpate comments
* update test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* update version
* 0.82
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* wrap iterators
* remove unused code
* refactor
* fix typo
* use gain for sklearn feature_importances_
`gain` is a better feature importance criteria than the currently used `weight`
* added importance_type to class
* fixed test
* white space
* fix variable name
* fix deprecation warning
* fix exp array
* white spaces
* Enable running objectives with 0 GPU.
* Enable 0 GPU for objectives.
* Add doc for GPU objectives.
* Fix some objectives defaulted to running on all GPUs.
* Make C++ unit tests run and pass on Windows
* Fix logic for external memory. The letter ':' is part of drive letter,
so remove the drive letter before splitting on ':'.
* Cosmetic syntax changes to keep MSVC happy.
* Fix lint
* Add Windows guard
* Fix#3342 and h2oai/h2o4gpu#625: Save predictor parameters in model file
This allows pickled models to retain predictor attributes, such as
'predictor' (whether to use CPU or GPU) and 'n_gpu' (number of GPUs
to use). Related: h2oai/h2o4gpu#625Closes#3342.
TODO. Write a test.
* Fix lint
* Do not load GPU predictor into CPU-only XGBoost
* Add a test for pickling GPU predictors
* Make sample data big enough to pass multi GPU test
* Update test_gpu_predictor.cu
* Fix#3747: Add coef_ and intercept_ as properties of sklearn wrapper
Scikit-learn expects linear learners to expose `coef_` and `intercept_`
as properties.
Closes#3747.
* Fix lint
* Clean up logic for converting tree_method to updater sequence
* Use C++11 enum class for extra safety
Compiler will give warnings if switch statements don't handle all
possible values of C++11 enum class.
Also allow enum class to be used as DMLC parameter.
* Fix compiler error + lint
* Address reviewer comment
* Better docstring for DECLARE_FIELD_ENUM_CLASS
* Fix lint
* Add C++ test to see if tree_method is recognized
* Fix clang-tidy error
* Add test_learner.h to R package
* Update comments
* Fix lint error
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* documenting tracker
* Make it a separate note
The `save_model()` and `load_model()` method only saves the part of the model
that's common to all language interfaces and do not preserve Python-specific
attributes, such as `feature_names`. More crucially, label encoder is not
preserved either; this is needed for the scikit-learn wrapper, since you may
have string labels.
Fix: Explicitly recommend pickling as the way to save scikit-learn model
objects.
* Multi-GPU support in GPUPredictor.
- GPUPredictor is multi-GPU
- removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix
* Replaced pointers with spans in GPUPredictor.
* Added a multi-GPU predictor test.
* Fix multi-gpu test.
* Fix n_rows < n_gpus.
* Reinitialize shards when GPUSet is changed.
* Tests range of data.
* Remove commented code.
* Remove commented code.
* Enable auto-locking of issues closed long ago
Issues that were closed more than 90 days ago will be locked automatically so
that no additional comments would be allowed. We will use a bot to do
this: https://probot.github.io/apps/lock/
Background: As a maintainer, I often see people leaving comments to old issue
posts that were closed long ago. Those comments are hard to discover and assist
with, since they get buried under list of other active issues.
With the change, users who want to follow up with an old issue would be asked
to file a new issue.
* Exempt `feature-request` from auto locking
* Disable comment to avoid triggering notification
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* temp
* add method for classifier and regressor
* update tutorial
* address the comments
* update
A privilege escalation vulnerability (CVE-2017-15288) has been
identified in the Scala compilation daemon. See
https://nvd.nist.gov/vuln/detail/CVE-2017-15288
Fix: Upgrade Scala to 2.11.12.
**Symptom** Apple Clang's implementation of `std::shuffle` expects doesn't work
correctly when it is run with the random bit generator for R package:
```cpp
CustomGlobalRandomEngine::result_type
CustomGlobalRandomEngine::operator()() {
return static_cast<result_type>(
std::floor(unif_rand() * CustomGlobalRandomEngine::max()));
}
```
Minimial reproduction of failure (compile using Apple Clang 10.0):
```cpp
std::vector<int> feature_set(100);
std::iota(feature_set.begin(), feature_set.end(), 0);
// initialize with 0, 1, 2, 3, ..., 99
std::shuffle(feature_set.begin(), feature_set.end(), common::GlobalRandom());
// This returns 0, 1, 2, ..., 99, so content didn't get shuffled at all!!!
```
Note that this bug is platform-dependent; it does not appear when GCC or
upstream LLVM Clang is used.
**Diagnosis** Apple Clang's `std::shuffle` expects 32-bit integer
inputs, whereas `CustomGlobalRandomEngine::operator()` produces 64-bit
integers.
**Fix** Have `CustomGlobalRandomEngine::operator()` produce 32-bit integers.
Closes#3523.
* Split building histogram into separated class.
* Extract `InitCompressedRow` definition.
* Basic tests for gpu-hist.
* Document the code more verbosely.
* Removed `HistCutUnit`.
* Removed some duplicated copies in `GPUHistMaker`.
* Implement LCG and use it in tests.
* Added some instructions on using MinGW-built XGBoost with python.
* Changes according to the discussion and some additions
* Fixed wording and removed redundancy.
* Even more fixes
* Fixed links. Removed redundancy.
* Some fixes according to the discussion
* fixes
* Some fixes
* fixes
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* sparjJobThread
* update
* fix issue when spark job execution thread cannot return before we execute first()
* Implement Transform class.
* Add tests for softmax.
* Use Transform in regression, softmax and hinge objectives, except for Cox.
* Mark old gpu objective functions deprecated.
* static_assert for softmax.
* Split up multi-gpu tests.
* DMatrix refactor 2
* Remove buffered rowset usage where possible
* Transition to c++11 style iterators for row access
* Transition column iterators to C++ 11
* Add multi-GPU unit test environment
* Better assertion message
* Temporarily disable failing test
* Distinguish between multi-GPU and single-GPU CPP tests
* Consolidate Python tests. Use attributes to distinguish multi-GPU Python tests from single-CPU counterparts
* Fix#3730: scikit-learn 0.20 compatibility fix
sklearn.cross_validation has been removed from scikit-learn 0.20,
so replace it with sklearn.model_selection
* Display test names for Python tests for clarity
* add back train method but mark as deprecated
* fix scalastyle error
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* instrumentation
* use log console
* better measurement
* fix erros in example
* update histmaker
* add a demo of multi-class classification R version
* add a demo of multi-class classification result
* add intro to the demo readme
* Delete train.md
* Update README.md
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* remove copy paste error
* added test, commented out right now
* reinstated test
* added fix for checking encryption settings
* fix by using RDD conf
* fix compilation
* renamed conf
* use SparkSession if available
* fix message
* nop
* code review fixes
* Fix#3397: early_stop callback does not maximize metric of form NDCG@n-
Early stopping callback makes splits with '-' letter, which interferes
with metrics of form NDCG@n-. As a result, XGBoost tries to minimize
NDCG@n-, where it should be maximized instead.
Fix. Specify maxsplit=1.
* Python 2.x compatibility fix
* Add scikit-learn tests
Goal is to pass scikit-learn's check_estimator() for XGBClassifier,
XGBRegressor, and XGBRanker. It is actually not possible to do so
entirely, since check_estimator() assumes that NaN is disallowed,
but XGBoost allows for NaN as missing values. However, it is always
good ideas to add some checks inspired by check_estimator().
* Fix lint
* Fix lint
* add interaction constraints
* enable both interaction and monotonic constraints at the same time
* fix lint
* add R test, fix lint, update demo
* Use dmlc::JSONReader to express interaction constraints as nested lists; Use sparse arrays for bookkeeping
* Add Python test for interaction constraints
* make R interaction constraints parameter based on feature index instead of column names, fix R coding style
* Fix lint
* Add BlueTea88 to CONTRIBUTORS.md
* Short circuit when no constraint is specified; address review comments
* Add tutorial for feature interaction constraints
* allow interaction constraints to be passed as string, remove redundant column_names argument
* Fix typo
* Address review comments
* Add comments to Python test
- previously, vec_ in DeviceShard wasn't updated on copy; as a result,
the shards continued to refer to the old HostDeviceVectorImpl object,
which resulted in a dangling pointer once that object was deallocated
* Fix#3648: XGBClassifier.predict() should return margin scores when output_margin=True
* Fix tests to reflect correct implementation of XGBClassifier.predict(output_margin=True)
* Fix flaky test test_with_sklearn.test_sklearn_api_gblinear
* Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage.
- added distributions to HostDeviceVector
- using HostDeviceVector for labels, weights and base margings in MetaInfo
- using HostDeviceVector for offset and data in SparsePage
- other necessary refactoring
* Added const version of HostDeviceVector API calls.
- const versions added to calls that can trigger data transfers, e.g. DevicePointer()
- updated the code that uses HostDeviceVector
- objective functions now accept const HostDeviceVector<bst_float>& for predictions
* Updated src/linear/updater_gpu_coordinate.cu.
* Added read-only state for HostDeviceVector sync.
- this means no copies are performed if both host and devices access
the HostDeviceVector read-only
* Fixed linter and test errors.
- updated the lz4 plugin
- added ConstDeviceSpan to HostDeviceVector
- using device % dh::NVisibleDevices() for the physical device number,
e.g. in calls to cudaSetDevice()
* Fixed explicit template instantiation errors for HostDeviceVector.
- replaced HostDeviceVector<unsigned int> with HostDeviceVector<int>
* Fixed HostDeviceVector tests that require multiple GPUs.
- added a mock set device handler; when set, it is called instead of cudaSetDevice()
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* interrupted exception is not rethrown
* Add XGBRanker to Python API doc
* Show inherited members of XGBRegressor in API doc, since XGBRegressor uses default methods from XGBModel
* Add table of contents to Python API doc
* Skip JVM doc download if not available
* Show inherited members for XGBRegressor and XGBRanker
* Expose XGBRanker to Python XGBoost module directory
* Add docstring to XGBRegressor.predict() and XGBRanker.predict()
* Fix rendering errors in Python docstrings
* Fix lint
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix update checkpoint func
* added xgbranker
* fixed predict method and ranking test
* reformatted code in accordance with pep8
* fixed lint error
* fixed docstring and added checks on objective
* added ranking demo for python
* fixed suffix in rank.py
* Add basic Span class based on ISO++20.
* Use Span<Entry const> instead of Inst in SparsePage.
* Add DeviceSpan in HostDeviceVector, use it in regression obj.
This pull request amends the broken #3062 allow Spark 2.2 to work.
Please note this won't work in Spark <=2.1 as sc.removeSparkListener was implemented in Spark 2.2. (So perhaps a more general method is better, although that is what was attempted in #3062)
This PR fixes: #3208, #3151 and the discussion in #1927.
I do find it strange that #3062 dose not work in Spark 2.2, it's probably due to some sort of public/private issue in the org.apache.spark.scheduler.LiveListenerBus class inheritance (In Spark itself). The error is: `java.lang.NoSuchMethodError: org.apache.spark.scheduler.LiveListenerBus.removeListener(Ljava/lang/Object;)V`
* Adding Java/Scala doc build to Jenkins CI
* Deploy built doc to S3 bucket
* Build doc only for branches
* Build doc first, to get doc faster for branch updates
* Have ReadTheDocs download doc tarball from S3
* Update JVM doc links
* Put doc build commands in a script
* Specify Spark 2.3+ requirement for XGBoost4J-Spark
* Build GPU wheel without NCCL, to reduce binary size
* Revert "Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)"
This reverts commit 44811f2330.
* Document behavior of predict() for DART booster
* Add notice to parameter.rst
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* partial finish
* no test
* add test cases
* add test cases
* address comments
* add test for regressor
* fix typo
* Fix#3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows
Description: The bug is triggered when
1. The data matrix has empty rows at the bottom. More precisely, the rows
`n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.
Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.
Fix: Modify the row pointer.
* Add regression test
The base margin will need to have length `[num_class] * [number of data points]`.
Otherwise, the array holding prediction results will be only partially
initialized, causing undefined behavior.
Fix: check the length of the base margin. If the length is not correct,
use the global bias (`base_score`) instead. Warn the user about the
substitution.
* Fix#3402: wrong fid crashes distributed algorithm
The bug was introduced by the recent DMatrix refactor (#3301). It was partially
fixed by #3408 but the example in #3402 was still failing. The example in #3402
will succeed after this fix is applied.
* Explicitly specify "this" to prevent compile error
* Add regression test
* Add distributed test to Travis matrix
* Install kubernetes Python package as dependency of dmlc tracker
* Add Python dependencies
* Add compile step
* Reduce size of regression test case
* Further reduce size of test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add new
* update doc
* finish Gang Scheduling
* more
* intro
* Add sections: Prediction, Model persistence and ML pipeline.
* Add XGBoost4j-Spark MLlib pipeline example
* partial finished version
* finish the doc
* adjust code
* fix the doc
* use rst
* Convert XGBoost4J-Spark tutorial to reST
* Bring XGBoost4J up to date
* add note about using hdfs
* remove duplicate file
* fix descriptions
* update doc
* Wrap HDFS/S3 export support as a note
* update
* wrap indexing_mode example in code block
This bring many goodies, including:
* Ability to specify delimiter and weight_column for CSV files:
```python
dtrain = xgboost.DMatrix('train.csv?format=csv&label_column=0&weight_column=1&delimiter= ')
```
* Ability to choose between 0-based and 1-based indexing for LIBSVM/LIBFM files:
```python
dtrain = xgboost.DMatrix('train.libsvm?indexing_mode=1') # use 1-based indexing
dtest = xgboost.DMatrix('test.libsvm') # use 0-based indexing (default)
dtest2 = xgboost.DMatrix('test2.libsvm?indexing_mode=-1') # use heuristic to detect 0-based / 1-based
```
* Fix a bug in float parsing (issue dmlc/dmlc-core#440)
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider spark.task.cpus when controlling parallelism
* fix bug
* fix conf setup
* calculate requestedCores within ParallelismController
* enforce spark.task.cpus = 1
* unify unit test case framework
* enable spark ui
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider missing value in prediction
* handle single prediction instance
* fix type conversion
* Fix bug of using list(x) function when x is string
list('abcdcba') = ['a', 'b', 'c', 'd', 'c', 'b', 'a']
* Allow feature_names/feature_types to be of any type
If feature_names/feature_types is iterable, e.g. tuple, list, then convert the value to list, except for string; otherwise construct a list with a single value
* Delete excess whitespace
* Fix whitespace to pass lint
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)
* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.
* Fix appveyor R test
* Save max_delta_step as an extra attribute of Booster
Fixes#3509 and #3026, where `max_delta_step` parameter gets lost during serialization.
* fix lint
* Use camel case for global constant
* disable local variable case in clang-tidy
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
Add `'total_gain'` and `'total_cover'` as possible `importance_type`
arguments to `Booster.get_score` in the Python package.
`get_score` already accepts a `'gain'` argument, which returns each
feature's average gain over all of its splits. `'total_gain'` does the
same, but returns a total rather than an average. This seems more
intuitively meaningful, and also matches the behavior of the R package's
`xgb.importance` function.
I also added an analogous `'total_cover'` command for consistency.
This should resolve#3484.
* Improved library loading a bit
* Fixed indentation.
* Fixes according to the discussion
* Moved the comment to a separate line.
* specified exception type
* Change doc build to reST exclusively
* Rewrite Intro doc in reST; create toctree
* Update parameter and contribute
* Convert tutorials to reST
* Convert Python tutorials to reST
* Convert CLI and Julia docs to reST
* Enable markdown for R vignettes
* Done migrating to reST
* Add guzzle_sphinx_theme to requirements
* Add breathe to requirements
* Fix search bar
* Add link to user forum
* Fail GPU CI after test failure
* Fix GPU linear tests
* Reduced number of GPU tests to speed up CI
* Remove static allocations of device memory
* Resolve illegal memory access for updater_fast_hist.cc
* Fix broken r tests dependency
* Update python install documentation for GPU
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* disable booster setup in spark
* check in parameter conversion
* fix compilation issue
* update exception type
* add qid for https://github.com/dmlc/xgboost/issues/2748
* change names
* change spaces
* change qid to bst_uint type
* change qid type to size_t
* change qid first to SIZE_MAX
* change qid type from size_t to uint64_t
* update dmlc-core
* fix qids name error
* fix group_ptr_ error
* Style fix
* Add qid handling logic to SparsePage
* New MetaInfo format + backward compatibility fix
Old MetaInfo format (1.0) doesn't contain qid field. We still want to be able
to read from MetaInfo files saved in old format. Also, define a new format
(2.0) that contains the qid field. This way, we can distinguish files that
contain qid and those that do not.
* Update MetaInfo test
* Simply group assignment logic
* Explicitly set qid=nullptr in NativeDataIter
NativeDataIter's callback does not support qid field. Users of NativeDataIter
will need to call setGroup() function separately to set group information.
* Save qids_ in SaveBinary()
* Upgrade dmlc-core submodule
* Add a test for reading qid
* Add contributor
* Check the size of qids_
* Document qid format
* allow arbitrary cross validation fold indices
- use training indices passed to `folds` parameter in `training.cv`
- update doc string
* add tests for arbitrary fold indices
* Refactor to allow for custom regularisation methods
* Implement compositional SplitEvaluator framework
* Fixed segfault when no monotone_constraints are supplied.
* Change pid to parentID
* test_monotone_constraints.py now passes
* Refactor ColMaker and DistColMaker to use SplitEvaluator
* Performance optimisation when no monotone_constraints specified
* Fix linter messages
* Fix a few more linter errors
* Update the amalgamation
* Add bounds check
* Add check for leaf node
* Fix linter error in param.h
* Fix clang-tidy errors on CI
* Fix incorrect function name
* Fix clang-tidy error in updater_fast_hist.cc
* Enable SSE2 for Win32 R MinGW
Addresses https://github.com/dmlc/xgboost/pull/3335#issuecomment-400535752
* Add contributor
CI tests were failing because wget prompts "the user" for a response
whenever the google test archive is already on the disk.
Fix: Use `-nc` option to skip download when the archive already
exists
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* maven central release
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* [jvm-packages] XGBoost Spark integration refactor. (#3313)
* XGBoost Spark integration refactor.
* Make corresponding update for xgboost4j-example
* Address comments.
* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)
* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib
* Fix extra space.
* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)
* XGBoost Spark supports ranking with group data.
* Use Iterator.duplicate to prevent OOM.
* Update CheckpointManagerSuite.scala
* Resolve conflicts
* Use sparse page as singular CSR matrix representation
* Simplify dmatrix methods
* Reduce statefullness of batch iterators
* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
* GPU binning and compression.
- binning and index compression are done inside the DeviceShard constructor
- in case of a DMatrix with multiple row batches, it is first converted into a single row batch
Currently, `CLIPredict()` saves prediction results in default 6-digit precision which causes precision loss. This PR sets precision to a level so that the conversion back to `bst_float` is lossless.
Related: #3298.
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update 0.80
Thanks for participating in the XGBoost community! We use https://discuss.xgboost.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first :)
Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.
For bug reports, to help the developer act on the issues, please include a description of your environment, preferably a minimum script to reproduce the problem.
For feature proposals, list clear, small actionable items so we can track the progress of the change.
XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users.
Project Management Committee(PMC)
----------
The Project Management Committee(PMC) consists group of active committers that moderate the discussion, manage the project release, and proposes new committer/PMC members.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a Ph.D. student working on large-scale machine learning. He is the creator of the project.
- Michael is a lawyer and data scientist in France. He is the creator of XGBoost interactive analysis module in R.
* [Yuan Tang](https://github.com/terrytangyuan), Ant Financial
- Yuan is a software engineer in Ant Financial. He contributed mostly in R and Python packages.
* [Nan Zhu](https://github.com/CodingCat), Uber
- Nan is a software engineer in Uber. He contributed mostly in JVM packages.
* [Jiaming Yuan](https://github.com/trivialfis)
- Jiaming contributed to the GPU algorithms. He has also introduced new abstractions to improve the quality of the C++ codebase.
* [Hyunsu Cho](http://hyunsu-cho.io/), NVIDIA
- Hyunsu is the maintainer of the XGBoost Python package. He also manages the Jenkins continuous integration system (https://xgboost-ci.net/). He is the initial author of the CPU 'hist' updater.
* [Rory Mitchell](https://github.com/RAMitchell), University of Waikato
- Rory is a Ph.D. student at University of Waikato. He is the original creator of the GPU training algorithms. He improved the CMake build system and continuous integration.
* [Hongliang Liu](https://github.com/phunterlau)
Committers
----------
Committers are people who have made substantial contribution to the project and granted write access to the project.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a PhD working on large-scale machine learning, he is the creator of the project.
* [Tong He](https://github.com/hetong007), Amazon AI
- Tong is an applied scientist in Amazon AI, he is the maintainer of xgboost R package.
- Tong is an applied scientist in Amazon AI. He is the maintainer of XGBoost R package.
-Sergei is a software engineer in Criteo. He contributed mostly in JVM packages.
* [Scott Lundberg](http://scottlundberg.com/), University of Washington
-Scott is a Ph.D. student at University of Washington. He is the creator of SHAP, a unified approach to explain the output of machine learning models such as decision tree ensembles. He also helps maintain the XGBoost Julia package.
Become a Committer
------------------
@@ -36,27 +53,25 @@ List of Contributors
* [Full List of Contributors](https://github.com/dmlc/xgboost/graphs/contributors)
- To contributors: please add your name to the list when you submit a patch to the project:)
* [Kailong Chen](https://github.com/kalenhaha)
- Kailong is an early contributor of xgboost, he is creator of ranking objectives in xgboost.
- Kailong is an early contributor of XGBoost, he is creator of ranking objectives in XGBoost.
* [Skipper Seabold](https://github.com/jseabold)
- Skipper is the major contributor to the scikit-learn module of xgboost.
- Skipper is the major contributor to the scikit-learn module of XGBoost.
* [Zygmunt Zając](https://github.com/zygmuntz)
- Zygmunt is the master behind the early stopping feature frequently used by kagglers.
* [Ajinkya Kale](https://github.com/ajkl)
* [Boliang Chen](https://github.com/cblsjtu)
* [Yangqing Men](https://github.com/yanqingmen)
- Yangqing is the creator of xgboost java package.
- Yangqing is the creator of XGBoost java package.
* [Engpeng Yao](https://github.com/yepyao)
* [Giulio](https://github.com/giuliohome)
- Giulio is the creator of windows project of xgboost
- Giulio is the creator of Windows project of XGBoost
* [Jamie Hall](https://github.com/nerdcha)
- Jamie is the initial creator of xgboost sklearn module.
- Jamie is the initial creator of XGBoost scikit-learn module.
* [Yen-Ying Lee](https://github.com/white1033)
* [Masaaki Horikoshi](https://github.com/sinhrks)
- Masaaki is the initial creator of xgboost python plotting module.
* [Hongliang Liu](https://github.com/phunterlau)
- Hongliang is the maintainer of xgboost python PyPI package for pip installation.
- Masaaki is the initial creator of XGBoost Python plotting module.
* [daiyl0320](https://github.com/daiyl0320)
- daiyl0320 contributed patch to xgboost distributed version more robust, and scales stably on TB scale datasets.
- daiyl0320 contributed patch to XGBoost distributed version more robust, and scales stably on TB scale datasets.
This file records the changes in xgboost library in reverse chronological order.
## v1.0.0 (2020.02.19)
This release marks a major milestone for the XGBoost project.
### Apache-style governance, contribution policy, and semantic versioning (#4646, #4659)
* Starting with 1.0.0 release, the XGBoost Project is adopting Apache-style governance. The full community guideline is [available in the doc website](https://xgboost.readthedocs.io/en/release_1.0.0/contrib/community.html). Note that we now have Project Management Committee (PMC) who would steward the project on the long-term basis. The PMC is also entrusted to run and fund the project's continuous integration (CI) infrastructure (https://xgboost-ci.net).
* We also adopt the [semantic versioning](https://semver.org/). See [our release versioning policy](https://xgboost.readthedocs.io/en/release_1.0.0/contrib/release.html).
* Poor performance scaling of the `hist` algorithm for multi-core CPUs has been under investigation (#3810). Previous effort #4529 was replaced with a series of pull requests (#5107, #5138, #5156) aimed at achieving the same performance benefits while keeping the C++ codebase legible. The latest performance benchmark results show [up to 5x speedup on Intel CPUs with many cores](https://github.com/dmlc/xgboost/pull/5156#issuecomment-580024413). Note: #5244, which concludes the effort, will become part of the upcoming release 1.1.0.
### Improved installation experience on Mac OSX (#4672, #5074, #5080, #5146, #5240)
* It used to be quite complicated to install XGBoost on Mac OSX. XGBoost uses OpenMP to distribute work among multiple CPU cores, and Mac's default C++ compiler (Apple Clang) does not come with OpenMP. Existing work-around (using another C++ compiler) was complex and prone to fail with cryptic diagnosis (#4933, #4949, #4969).
* Now it only takes two commands to install XGBoost: `brew install libomp` followed by `pip install xgboost`. The installed XGBoost will use all CPU cores.
* Even better, XGBoost is now available from Homebrew: `brew install xgboost`. See Homebrew/homebrew-core#50467.
* Previously, if you installed the XGBoost R package using the command `install.packages('xgboost')`, it could only use a single CPU core and you would experience slow training performance. With 1.0.0 release, the R package will use all CPU cores out of box.
### Distributed XGBoost now available on Kubernetes (#4621, #4939)
* Check out the [tutorial for setting up distributed XGBoost on a Kubernetes cluster](https://xgboost.readthedocs.io/en/release_1.0.0/tutorials/kubernetes.html).
### Ruby binding for XGBoost (#4856)
### New Native Dask interface for multi-GPU and multi-node scaling (#4473, #4507, #4617, #4819, #4907, #4914, #4941, #4942, #4951, #4973, #5048, #5077, #5144, #5270)
* XGBoost now integrates seamlessly with [Dask](https://dask.org/), a lightweight distributed framework for data processing. Together with the first-class support for cuDF data frames (see below), it is now easier than ever to create end-to-end data pipeline running on one or more NVIDIA GPUs.
* Multi-GPU training with Dask is now up to 20% faster than the previous release (#4914, #4951).
### First-class support for cuDF data frames and cuPy arrays (#4737, #4745, #4794, #4850, #4891, #4902, #4918, #4927, #4928, #5053, #5189, #5194, #5206, #5219, #5225)
* [cuDF](https://github.com/rapidsai/cudf) is a data frame library for loading and processing tabular data on NVIDIA GPUs. It provides a Pandas-like API.
* [cuPy](https://github.com/cupy/cupy) implements a NumPy-compatible multi-dimensional array on NVIDIA GPUs.
* Now users can keep the data on the GPU memory throughout the end-to-end data pipeline, obviating the need for copying data between the main memory and GPU memory.
* XGBoost can accept any data structure that exposes `__array_interface__` signature, opening way to support other columar formats that are compatible with Apache Arrow.
### [Feature interaction constraint](https://xgboost.readthedocs.io/en/release_1.0.0/tutorials/feature_interaction_constraint.html) is now available with `approx` and `gpu_hist` algorithms (#4534, #4587, #4596, #5034).
### Learning to rank is now GPU accelerated (#4873, #5004, #5129)
* [Up to 2x improved training performance on GPUs](https://devblogs.nvidia.com/learning-to-rank-with-xgboost-and-gpu/).
### Enable `gamma` parameter for GPU training (#4874, #4953)
* The `gamma` parameter specifies the minimum loss reduction required to add a new split in a tree. A larger value for `gamma` has the effect of pre-pruning the tree, by making harder to add splits.
### External memory for GPU training (#4486, #4526, #4747, #4833, #4879, #5014)
* It is now possible to use NVIDIA GPUs even when the size of training data exceeds the available GPU memory. Note that the external memory support for GPU is still experimental. #5093 will further improve performance and will become part of the upcoming release 1.1.0.
* RFC for enabling external memory with GPU algorithms: #4357
* Many users of XGBoost enjoy the convenience and breadth of Scikit-Learn ecosystem. In this release, we revise the Scikit-Learn API of XGBoost (`XGBRegressor`, `XGBClassifier`, and `XGBRanker`) to achieve feature parity with the traditional XGBoost interface (`xgboost.train()`).
* Insert check to validate data shapes.
* Produce an error message if `eval_set` is not a tuple. An error message is better than silently crashing.
* Clean up checkpoint file after a successful training job (#4754): The current implementation in XGBoost4J-Spark does not clean up the checkpoint file after a successful training job. If the user runs another job with the same checkpointing directory, she will get a wrong model because the second job will re-use the checkpoint file left over from the first job. To prevent this scenario, we propose to always clean up the checkpoint file after every successful training job.
* Avoid Multiple Jobs for Checkpointing (#5082): The current method for checkpoint is to collect the booster produced at the last iteration of each checkpoint internal to Driver and persist it in HDFS. The major issue with this approach is that it needs to re-perform the data preparation for training if the user did not choose to cache the training dataset. To avoid re-performing data prep, we build external-memory checkpointing in the XGBoost4J layer as well.
* Enable deterministic repartitioning when checkpoint is enabled (#4807): Distributed algorithm for gradient boosting assumes a fixed partition of the training data between multiple iterations. In previous versions, there was no guarantee that data partition would stay the same, especially when a worker goes down and some data had to recovered from previous checkpoint. In this release, we make data partition deterministic by using the data hash value of each data row in computing the partition.
### XGBoost4J-Spark: handle errors thrown by the native code (#4560)
* All core logic of XGBoost is written in C++, so XGBoost4J-Spark internally uses the C++ code via Java Native Interface (JNI). #4560 adds a proper error handling for any errors or exceptions arising from the C++ code, so that the XGBoost Spark application can be torn down in an orderly fashion.
### XGBoost4J-Spark: Refine method to count the number of alive cores (#4858)
* The `SparkParallelismTracker` class ensures that sufficient number of executor cores are alive. To that end, it is important to query the number of alive cores reliably.
### XGBoost4J: Add `BigDenseMatrix` to store more than `Integer.MAX_VALUE` elements (#4383)
* In this release, we introduce an experimental support of using [JSON](https://www.json.org/json-en.html) for serializing (saving/loading) XGBoost models and related hyperparameters for training. We would like to eventually replace the old binary format with JSON, since it is an open format and parsers are available in many programming languages and platforms. See [the documentation for model I/O using JSON](https://xgboost.readthedocs.io/en/release_1.0.0/tutorials/saving_model.html). #3980 explains why JSON was chosen over other alternatives.
* To maximize interoperability and compatibility of the serialized models, we now split serialization into two parts (#4855):
1. Model, e.g. decision trees and strictly related metadata like `num_features`.
2. Internal configuration, consisting of training parameters and other configurable parameters. For example, `max_delta_step`, `tree_method`, `objective`, `predictor`, `gpu_id`.
Previously, users often ran into issues where the model file produced by one machine could not load or run on another machine. For example, models trained using a machine with an NVIDIA GPU could not run on another machine without a GPU (#5291, #5234). The reason is that the old binary format saved some internal configuration that were not universally applicable to all machines, e.g. `predictor='gpu_predictor'`.
Now, model saving function (`Booster.save_model()` in Python) will save only the model, without internal configuration. This will guarantee that your model file would be used anywhere. Internal configuration will be serialized in limited circumstances such as:
* Multiple nodes in a distributed system exchange model details over the network.
* Model checkpointing, to recover from possible crashes.
This work proved to be useful for parameter validation as well (see below).
* Starting with 1.0.0 release, we will use semantic versioning to indicate whether the model produced by one version of XGBoost would be compatible with another version of XGBoost. Any change in the major version indicates a breaking change in the serialization format.
* We now provide a robust method to save and load scikit-learn related attributes (#5245). Previously, we used Python pickle to save Python attributes related to `XGBClassifier`, `XGBRegressor`, and `XGBRanker` objects. The attributes are necessary to properly interact with scikit-learn. See #4639 for more details. The use of pickling hampered interoperability, as a pickle from one machine may not necessarily work on another machine. Starting with this release, we use an alternative method to serialize the scikit-learn related attributes. The use of Python pickle is now discouraged (#5236, #5281).
### Parameter validation: detection of unused or incorrect parameters (#4553, #4577, #4738, #4801, #4961, #5101, #5157, #5167, #5256)
* Mis-spelled training parameter is a common user mistake. In previous versions of XGBoost, mis-spelled parameters were silently ignored. Starting with 1.0.0 release, XGBoost will produce a warning message if there is any unused training parameters. Currently, parameter validation is available to R users and Python XGBoost API users. We are working to extend its support to scikit-learn users.
* Configuration steps now have well-defined semantics (#4542, #4738), so we know exactly where and how the internal configurable parameters are changed.
* The user can now use `save_config()` function to inspect all (used) training parameters. This is helpful for debugging model performance.
### Allow individual workers to recover from faults (#4808, #4966)
* Status quo: if a worker fails, all workers are shut down and restarted, and learning resumes from the last checkpoint. This involves requesting resources from the scheduler (e.g. Spark) and shuffling all the data again from scratch. Both of these operations can be quite costly and block training for extended periods of time, especially if the training data is big and the number of worker nodes is in the hundreds.
* The proposed solution is to recover the single node that failed, instead of shutting down all workers. The rest of the clusters wait until the single failed worker is bootstrapped and catches up with the rest.
* See roadmap at #4753. Note that this is work in progress. In particular, the feature is not yet available from XGBoost4J-Spark.
### Accurate prediction for DART models
* Use DART tree weights when computing SHAPs (#5050)
* Don't drop trees during DART prediction by default (#5115)
* Fix DART prediction in R (#5204)
### Make external memory more robust
* Fix issues with training with external memory on cpu (#4487)
* Fix crash with approx tree method on cpu (#4510)
* Fix external memory race in `exact` (#4980). Note: `dmlc::ThreadedIter` is not actually thread-safe. We would like to re-design it in the long term.
### Major refactoring of the `DMatrix` class (#4686, #4744, #4748, #5044, #5092, #5108, #5188, #5198)
* Goal 1: improve performance and reduce memory consumption. Right now, if the user trains a model with a NumPy array as training data, the array gets copies 2-3 times before training begins. We'd like to reduce duplication of the data matrix.
* Goal 2: Expose a common interface to external data, unify the way DMatrix objects are constructed and simplify the process of adding new external data sources. This work is essential for ingesting cuPy arrays.
* Goal 3: Handle missing values consistently.
* RFC: #4354, Roadmap: #5143
* This work is also relevant to external memory support on GPUs.
### Breaking: XGBoost Python package now requires Python 3.5 or newer (#5021, #5274)
* Python 3.4 has reached its end-of-life on March 16, 2019, so we now require Python 3.5 or newer.
### Breaking: GPU algorithm now requires CUDA 9.0 and higher (#4527, #4580)
### Breaking: `n_gpus` parameter removed; multi-GPU training now requires a distributed framework (#4579, #4749, #4773, #4810, #4867, #4908)
*#4531 proposed removing support for single-process multi-GPU training. Contributors would focus on multi-GPU support through distributed frameworks such as Dask and Spark, where the framework would be expected to assign a worker process for each GPU independently. By delegating GPU management and data movement to the distributed framework, we can greatly simplify the core XGBoost codebase, make multi-GPU training more robust, and reduce burden for future development.
### Breaking: Some deprecated features have been removed
* ``gpu_exact`` training method (#4527, #4742, #4777). Use ``gpu_hist`` instead.
* ``learning_rates`` parameter in Python (#5155). Use the callback API instead.
* ``num_roots`` (#5059, #5165), since the current training code always uses a single root node.
* GPU-specific objectives (#4690), such as `gpu:reg:linear`. Use objectives without `gpu:` prefix; GPU will be used automatically if your machine has one.
### Breaking: the C API function `XGBoosterPredict()` now asks for an extra parameter `training`.
### Breaking: We now use CMake exclusively to build XGBoost. `Makefile` is being sunset.
* Exception: the R package uses Autotools, as the CRAN ecosystem did not yet adopt CMake widely.
### Performance improvements
* Smarter choice of histogram construction for distributed `gpu_hist` (#4519)
* Optimizations for quantization on device (#4572)
* Introduce caching memory allocator to avoid latency associated with GPU memory allocation (#4554, #4615)
* Optimize the initialization stage of the CPU `hist` algorithm for sparse datasets (#4625)
* Prevent unnecessary data copies from GPU memory to the host (#4795)
* Improve operation efficiency for single prediction (#5016)
* Group builder modified for incremental building, to speed up building large `DMatrix` (#5098)
### Bug-fixes
* Eliminate `FutureWarning: Series.base is deprecated` (#4337)
* Ensure pandas DataFrame column names are treated as strings in type error message (#4481)
* [jvm-packages] Add back `reg:linear` for scala, as it is only deprecated and not meant to be removed yet (#4490)
* Fix library loading for Cygwin users (#4499)
* Fix prediction from loaded pickle (#4516)
* Enforce exclusion between `pred_interactions=True` and `pred_interactions=True` (#4522)
* Do not return dangling reference to local `std::string` (#4543)
* Set the appropriate device before freeing device memory (#4566)
* Mark `SparsePageDmatrix` destructor default. (#4568)
* Choose the appropriate tree method only when the tree method is 'auto' (#4571)
* Fix `benchmark_tree.py` (#4593)
* [jvm-packages] Fix silly bug in feature scoring (#4604)
* Fix GPU predictor when the test data matrix has different number of features than the training data matrix used to train the model (#4613)
* Fix external memory for get column batches. (#4622)
* [R] Use built-in label when xgb.DMatrix is given to xgb.cv() (#4631)
* Fix early stopping in the Python package (#4638)
* Fix AUC error in distributed mode caused by imbalanced dataset (#4645, #4798)
* [jvm-packages] Expose `setMissing` method in `XGBoostClassificationModel` / `XGBoostRegressionModel` (#4643)
* Remove initializing stringstream reference. (#4788)
* [R] `xgb.get.handle` now checks all class listed of `object` (#4800)
* Do not use `gpu_predictor` unless data comes from GPU (#4836)
* Fix data loading (#4862)
* Workaround `isnan` across different environments. (#4883)
* Don't `set_params` at the end of `set_state` (#4947). Ensure that the model does not change after pickling and unpickling multiple times.
* C++ exceptions should not crash OpenMP loops (#4960)
* Fix `usegpu` flag in DART. (#4984)
* Run training with empty `DMatrix` (#4990, #5159)
* Ensure that no two processes can use the same GPU (#4990)
* Fix repeated split and 0 cover nodes (#5010)
* Reset histogram hit counter between multiple data batches (#5035)
* Fix `feature_name` crated from int64index dataframe. (#5081)
* Don't use 0 for "fresh leaf" (#5084)
* Throw error when user attempts to use multi-GPU training and XGBoost has not been compiled with NCCL (#5170)
* Fix metric name loading (#5122)
* Quick fix for memory leak in CPU `hist` algorithm (#5153)
* Fix wrapping GPU ID and prevent data copying (#5160)
* Fix signature of Span constructor (#5166)
* Lazy initialization of device vector, so that XGBoost compiled with CUDA can run on a machine without any GPU (#5173)
* Model loading should not change system locale (#5314)
* Distributed training jobs would sometimes hang; revert Rabit to fix this regression (dmlc/rabit#132, #5237)
### API changes
* Add support for cross-validation using query ID (#4474)
* Enable feature importance property for DART model (#4525)
* Add `rmsle` metric and `reg:squaredlogerror` objective (#4541)
* All objective and evaluation metrics are now exposed to JVM packages (#4560)
*`dump_model()` and `get_dump()` now support exporting in GraphViz language (#4602)
* Support metrics `ndcg-` and `map-` (#4635)
* [jvm-packages] Allow chaining prediction (transform) in XGBoost4J-Spark (#4667)
* [jvm-packages] Add option to bypass missing value check in the Spark layer (#4805). Only use this option if you know what you are doing.
* [jvm-packages] Add public group getter (#4838)
*`XGDMatrixSetGroup` C API is now deprecated (#4864). Use `XGDMatrixSetUIntInfo` instead.
* [R] Added new `train_folds` parameter to `xgb.cv()` (#5114)
* Ingest meta information from Pandas DataFrame, such as data weights (#5216)
### Maintenance: Refactor code for legibility and maintainability
* De-duplicate GPU parameters (#4454)
* Simplify INI-style config reader using C++11 STL (#4478, #4521)
* Refactor histogram building code for `gpu_hist` (#4528)
* Overload device memory allocator, to enable instrumentation for compiling memory usage statistics (#4532)
* Refactor out row partitioning logic from `gpu_hist` (#4554)
* Remove an unused variable (#4588)
* Implement tree model dump with code generator, to de-duplicate code for generating dumps in 3 different formats (#4602)
* Remove `RowSet` class which is no longer being used (#4697)
* Remove some unused functions as reported by cppcheck (#4743)
* Mimic CUDA assert output in Span check (#4762)
* [jvm-packages] Refactor `XGBoost.scala` to put all params processing in one place (#4815)
* Add some comments for GPU row partitioner (#4832)
* Span: use `size_t' for index_type, add `front' and `back'. (#4935)
* Remove dead code in `exact` algorithm (#5034, #5105)
* Unify integer types used for row and column indices (#5034)
* Extract feature interaction constraint from `SplitEvaluator` class. (#5034)
* [Breaking] De-duplicate paramters and docstrings in the constructors of Scikit-Learn models (#5130)
* Remove benchmark code from GPU tests (#5141)
* Clean up Python 2 compatibility code. (#5161)
* Extensible binary serialization format for `DMatrix::MetaInfo` (#5187). This will be useful for implementing censored labels for survival analysis applications.
* Cleanup clang-tidy warnings. (#5247)
### Maintenance: testing, continuous integration, build system
* Use `yaml.safe_load` instead of `yaml.load`. (#4537)
* Ensure GCC is at least 5.x (#4538)
* Remove all mention of `reg:linear` from tests (#4544)
* [jvm-packages] Upgrade to Scala 2.12 (#4574)
* [jvm-packages] Update kryo dependency to 2.22 (#4575)
* [CI] Specify account ID when logging into ECR Docker registry (#4584)
* Use Sphinx 2.1+ to compile documentation (#4609)
* Make Pandas optional for running Python unit tests (#4620)
* Fix spark tests on machines with many cores (#4634)
* [jvm-packages] Update local dev build process (#4640)
* Add optional dependencies to setup.py (#4655)
* [jvm-packages] Fix maven warnings (#4664)
* Remove extraneous files from the R package, to comply with CRAN policy (#4699)
* Remove VC-2013 support, since it is not C++11 compliant (#4701)
* [CI] Fix broken installation of Pandas (#4704, #4722)
* [jvm-packages] Clean up temporary files afer running tests (#4706)
* Specify version macro in CMake. (#4730)
* Include dmlc-tracker into XGBoost Python package (#4731)
* [CI] Use long key ID for Ubuntu repository fingerprints. (#4783)
* Remove plugin, cuda related code in automake & autoconf files (#4789)
* Skip related tests when scikit-learn is not installed. (#4791)
* Ignore vscode and clion files (#4866)
* Use bundled Google Test by default (#4900)
* [CI] Raise timeout threshold in Jenkins (#4938)
* Copy CMake parameter from dmlc-core. (#4948)
* Set correct file permission. (#4964)
* [CI] Update lint configuration to support latest pylint convention (#4971)
* [CI] Upload nightly builds to S3 (#4976, #4979)
* Add asan.so.5 to cmake script. (#4999)
* [CI] Fix Travis tests. (#5062)
* [CI] Locate vcomp140.dll from System32 directory (#5078)
* Implement training observer to dump internal states of objects (#5088). This will be useful for debugging.
* Fix visual studio output library directories (#5119)
* [jvm-packages] Comply with scala style convention + fix broken unit test (#5134)
* [CI] Repair download URL for Maven 3.6.1 (#5139)
* Don't use modernize-use-trailing-return-type in clang-tidy. (#5169)
* Explicitly use UTF-8 codepage when using MSVC (#5197)
* Add CMake option to run Undefined Behavior Sanitizer (UBSan) (#5211)
* Make some GPU tests deterministic (#5229)
* [R] Robust endian detection in CRAN xgboost build (#5232)
* Support FreeBSD (#5233)
* Make `pip install xgboost*.tar.gz` work by fixing build-python.sh (#5241)
* Fix compilation error due to 64-bit integer narrowing to `size_t` (#5250)
* Remove use of `std::cout` from R package, to comply with CRAN policy (#5261)
Python 2.x is reaching its end-of-life at the end of this year. [Many scientific Python packages are now moving to drop Python 2.x](https://python3statement.org/).
### XGBoost4J-Spark now requires Spark 2.4.x (#4377)
* Spark 2.3 is reaching its end-of-life soon. See discussion at #4389.
* **Consistent handling of missing values** (#4309, #4349, #4411): Many users had reported issue with inconsistent predictions between XGBoost4J-Spark and the Python XGBoost package. The issue was caused by Spark mis-handling non-zero missing values (NaN, -1, 999 etc). We now alert the user whenever Spark doesn't handle missing values correctly (#4309, #4349). See [the tutorial for dealing with missing values in XGBoost4J-Spark](https://xgboost.readthedocs.io/en/release_0.90/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values). This fix also depends on the availability of Spark 2.4.x.
### Roadmap: better performance scaling for multi-core CPUs (#4310)
* Poor performance scaling of the `hist` algorithm for multi-core CPUs has been under investigation (#3810). #4310 optimizes quantile sketches and other pre-processing tasks. Special thanks to @SmirnovEgorRu.
### Roadmap: Harden distributed training (#4250)
* Make distributed training in XGBoost more robust by hardening [Rabit](https://github.com/dmlc/rabit), which implements [the AllReduce primitive](https://en.wikipedia.org/wiki/Reduce_%28parallel_pattern%29). In particular, improve test coverage on mechanisms for fault tolerance and recovery. Special thanks to @chenqin.
### New feature: Multi-class metric functions for GPUs (#4368)
* Metrics for multi-class classification have been ported to GPU: `merror`, `mlogloss`. Special thanks to @trivialfis.
* With supported metrics, XGBoost will select the correct devices based on your system and `n_gpus` parameter.
### New feature: Scikit-learn-like random forest API (#4148, #4255, #4258)
* XGBoost Python package now offers `XGBRFClassifier` and `XGBRFRegressor` API to train random forests. See [the tutorial](https://xgboost.readthedocs.io/en/release_0.90/tutorials/rf.html). Special thanks to @canonizer
### New feature: use external memory in GPU predictor (#4284, #4396, #4438, #4457)
* It is now possible to make predictions on GPU when the input is read from external memory. This is useful when you want to make predictions with big dataset that does not fit into the GPU memory. Special thanks to @rongou, @canonizer, @sriramch.
### Maintenance: testing, continuous integration, build system
* **Major refactor of CMakeLists.txt** (#4323, #4333, #4453): adopt modern CMake and export XGBoost as a target
* **Major improvement in Jenkins CI pipeline** (#4234)
- Migrate all Linux tests to Jenkins (#4401)
- Builds and tests are now de-coupled, to test an artifact against multiple versions of CUDA, JDK, and other dependencies (#4401)
- Add Windows GPU to Jenkins CI pipeline (#4463, #4469)
* Support CUDA 10.1 (#4223, #4232, #4265, #4468)
* Python wheels are now built with CUDA 9.0, so that JIT is not required on Volta architecture (#4459)
* Integrate with NVTX CUDA profiler (#4205)
* Add a test for cpu predictor using external memory (#4308)
* Refactor tests to get rid of duplication (#4358)
* Remove test dependency on `craigcitro/r-travis`, since it's deprecated (#4353)
* Add files from local R build to `.gitignore` (#4346)
* Make XGBoost4J compatible with Java 9+ by revising NativeLibLoader (#4351)
* Jenkins build for CUDA 10.0 (#4281)
* Remove remaining `silent` and `debug_verbose` in Python tests (#4299)
* Use all cores to build XGBoost4J lib on linux (#4304)
* Upgrade Jenkins Linux build environment to GCC 5.3.1, CMake 3.6.0 (#4306)
* Make CMakeLists.txt compatible with CMake 3.3 (#4420)
* Add OpenMP option in CMakeLists.txt (#4339)
* Get rid of a few trivial compiler warnings (#4312)
* Add external Docker build cache, to speed up builds on Jenkins CI (#4331, #4334, #4458)
* Fix Windows tests (#4403)
* Fix a broken python test (#4395)
* Use a fixed seed to split data in XGBoost4J-Spark tests, for reproducibility (#4417)
* Add additional Python tests to test training under constraints (#4426)
* Enable building with shared NCCL. (#4447)
### Usability Improvements, Documentation
* Document limitation of one-split-at-a-time Greedy tree learning heuristic (#4233)
* Update build doc: PyPI wheel now support multi-GPU (#4219)
* Fix docs for `num_parallel_tree` (#4221)
* Fix document about `colsample_by*` parameter (#4340)
* Make the train and test input with same colnames. (#4329)
* Update R contribute link. (#4236)
* Fix travis R tests (#4277)
* Log version number in crash log in XGBoost4J-Spark (#4271, #4303)
* Allow supression of Rabit output in Booster::train in XGBoost4J (#4262)
* Add tutorial on handling missing values in XGBoost4J-Spark (#4425)
* Fix typos (#4345, #4393, #4432, #4435)
* Added language classifier in setup.py (#4327)
* Added Travis CI badge (#4344)
* Add BentoML to use case section (#4400)
* Remove subtly sexist remark (#4418)
* Add R vignette about parsing JSON dumps (#4439)
### Acknowledgement
**Contributors**: Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Daniel Hen (@Daniel8hen), Jiaxiang Li (@JiaxiangBU), Rory Mitchell (@RAMitchell), Egor Smirnov (@SmirnovEgorRu), Andy Adinets (@canonizer), Jonas (@elcombato), Harry Braviner (@harrybraviner), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), James Lamb (@jameslamb), Jean-Francois Zinque (@jeffzi), Yang Yang (@jokerkeny), Mayank Suman (@mayanksuman), jess (@monkeywithacupcake), Hajime Morrita (@omo), Ravi Kalia (@project-delphi), @ras44, Rong Ou (@rongou), Shaochen Shi (@shishaochen), Xu Xiao (@sperlingxx), @sriramch, Jiaming Yuan (@trivialfis), Christopher Suchanek (@wsuchy), Bozhao (@yubozhao)
**Reviewers**: Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Daniel Hen (@Daniel8hen), Jiaxiang Li (@JiaxiangBU), Laurae (@Laurae2), Rory Mitchell (@RAMitchell), Egor Smirnov (@SmirnovEgorRu), @alois-bissuel, Andy Adinets (@canonizer), Chen Qin (@chenqin), Harry Braviner (@harrybraviner), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), @jakirkham, James Lamb (@jameslamb), Julien Schueller (@jschueller), Mayank Suman (@mayanksuman), Hajime Morrita (@omo), Rong Ou (@rongou), Sara Robinson (@sararob), Shaochen Shi (@shishaochen), Xu Xiao (@sperlingxx), @sriramch, Sean Owen (@srowen), Sergei Lebedev (@superbobry), Yuan (Terry) Tang (@terrytangyuan), Theodore Vasiloudis (@thvasilo), Matthew Tovbin (@tovbinm), Jiaming Yuan (@trivialfis), Xin Yin (@xydrolase)
## v0.82 (2019.03.03)
This release is packed with many new features and bug fixes.
### Roadmap: better performance scaling for multi-core CPUs (#3957)
* Poor performance scaling of the `hist` algorithm for multi-core CPUs has been under investigation (#3810). #3957 marks an important step toward better performance scaling, by using software pre-fetching and replacing STL vectors with C-style arrays. Special thanks to @Laurae2 and @SmirnovEgorRu.
* See #3810 for latest progress on this roadmap.
### New feature: Distributed Fast Histogram Algorithm (`hist`) (#4011, #4102, #4140, #4128)
* It is now possible to run the `hist` algorithm in distributed setting. Special thanks to @CodingCat. The benefits include:
1. Faster local computation via feature binning
2. Support for monotonic constraints and feature interaction constraints
3. Simpler codebase than `approx`, allowing for future improvement
* Depth-wise tree growing is now performed in a separate code path, so that cross-node syncronization is performed only once per level.
### New feature: Multi-Node, Multi-GPU training (#4095)
* Distributed training is now able to utilize clusters equipped with NVIDIA GPUs. In particular, the rabit AllReduce layer will communicate GPU device information. Special thanks to @mt-jones, @RAMitchell, @rongou, @trivialfis, @canonizer, and @jeffdk.
* Resource management systems will be able to assign a rank for each GPU in the cluster.
* In Dask, users will be able to construct a collection of XGBoost processes over an inhomogeneous device cluster (i.e. workers with different number and/or kinds of GPUs).
### New feature: Multiple validation datasets in XGBoost4J-Spark (#3904, #3910)
* You can now track the performance of the model during training with multiple evaluation datasets. By specifying `eval_sets` or call `setEvalSets` over a `XGBoostClassifier` or `XGBoostRegressor`, you can pass in multiple evaluation datasets typed as a `Map` from `String` to `DataFrame`. Special thanks to @CodingCat.
* See the usage of multiple validation datasets [here](https://github.com/dmlc/xgboost/blob/0c1d5f1120c0a159f2567b267f0ec4ffadee00d0/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkTraining.scala#L66-L78)
### New feature: Additional metric functions for GPUs (#3952)
* Element-wise metrics have been ported to GPU: `rmse`, `mae`, `logloss`, `poisson-nloglik`, `gamma-deviance`, `gamma-nloglik`, `error`, `tweedie-nloglik`. Special thanks to @trivialfis and @RAMitchell.
* With supported metrics, XGBoost will select the correct devices based on your system and `n_gpus` parameter.
### New feature: Column sampling at individual nodes (splits) (#3971)
* Columns (features) can now be sampled at individual tree nodes, in addition to per-tree and per-level sampling. To enable per-node sampling, set `colsample_bynode` parameter, which represents the fraction of columns sampled at each node. This parameter is set to 1.0 by default (i.e. no sampling per node). Special thanks to @canonizer.
* The `colsample_bynode` parameter works cumulatively with other `colsample_by*` parameters: for example, `{'colsample_bynode':0.5, 'colsample_bytree':0.5}` with 100 columns will give 25 features to choose from at each split.
### Major API change: consistent logging level via `verbosity` (#3982, #4002, #4138)
* XGBoost now allows fine-grained control over logging. You can set `verbosity` to 0 (silent), 1 (warning), 2 (info), and 3 (debug). This is useful for controlling the amount of logging outputs. Special thanks to @trivialfis.
* Parameters `silent` and `debug_verbose` are now deprecated.
* Note: Sometimes XGBoost tries to change configurations based on heuristics, which is displayed as warning message. If there's unexpected behaviour, please try to increase value of verbosity.
### Major bug fix: external memory (#4040, #4193)
* Clarify object ownership in multi-threaded prefetcher, to avoid memory error.
* Correctly merge two column batches (which uses [CSC layout](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS))).
* Add unit tests for external memory.
* Special thanks to @trivialfis and @hcho3.
### Major bug fix: early stopping fixed in XGBoost4J and XGBoost4J-Spark (#3928, #4176)
* Early stopping in XGBoost4J and XGBoost4J-Spark is now consistent with its counterpart in the Python package. Training stops if the current iteration is `earlyStoppingSteps` away from the best iteration. If there are multiple evaluation sets, only the last one is used to determinate early stop.
* See the updated documentation [here](https://xgboost.readthedocs.io/en/release_0.82/jvm/xgboost4j_spark_tutorial.html#early-stopping)
* Special thanks to @CodingCat, @yanboliang, and @mingyang.
### Major bug fix: infrequent features should not crash distributed training (#4045)
* For infrequently occuring features, some partitions may not get any instance. This scenario used to crash distributed training due to mal-formed ranges. The problem has now been fixed.
* In practice, one-hot-encoded categorical variables tend to produce rare features, particularly when the cardinality is high.
* Special thanks to @CodingCat.
### Performance improvements
* Faster, more space-efficient radix sorting in `gpu_hist` (#3895)
* Subtraction trick in histogram calculation in `gpu_hist` (#3945)
* More performant re-partition in XGBoost4J-Spark (#4049)
### Bug-fixes
* Fix semantics of `gpu_id` when running multiple XGBoost processes on a multi-GPU machine (#3851)
* Fix page storage path for external memory on Windows (#3869)
* Fix configuration setup so that DART utilizes GPU (#4024)
* Eliminate NAN values from SHAP prediction (#3943)
* Prevent empty quantile sketches in `hist` (#4155)
* Enable running objectives with 0 GPU (#3878)
* Parameters are no longer dependent on system locale (#3891, #3907)
* Use consistent data type in the GPU coordinate descent code (#3917)
* Remove undefined behavior in the CLI config parser on the ARM platform (#3976)
* Initialize counters in GPU AllReduce (#3987)
* Prevent deadlocks in GPU AllReduce (#4113)
* Load correct values from sliced NumPy arrays (#4147, #4165)
* Fix incorrect GPU device selection (#4161)
* Make feature binning logic in `hist` aware of query groups when running a ranking task (#4115). For ranking task, query groups are weighted, not individual instances.
* Generate correct C++ exception type for `LOG(FATAL)` macro (#4159)
* Python package
- Python package should run on system without `PATH` environment variable (#3845)
- Fix `coef_` and `intercept_` signature to be compatible with `sklearn.RFECV` (#3873)
- Use UTF-8 encoding in Python package README, to support non-English locale (#3867)
- Add AUC-PR to list of metrics to maximize for early stopping (#3936)
- Allow loading pickles without `self.booster` attribute, for backward compatibility (#3938, #3944)
- White-list DART for feature importances (#4073)
- Update usage of [h2oai/datatable](https://github.com/h2oai/datatable) (#4123)
* XGBoost4J-Spark
- Address scalability issue in prediction (#4033)
- Enforce the use of per-group weights for ranking task (#4118)
- Fix vector size of `rawPredictionCol` in `XGBoostClassificationModel` (#3932)
- More robust error handling in Spark tracker (#4046, #4108)
- Fix return type of `setEvalSets` (#4105)
- Return correct value of `getMaxLeaves` (#4114)
### API changes
* Add experimental parameter `single_precision_histogram` to use single-precision histograms for the `gpu_hist` algorithm (#3965)
* Python package
- Add option to select type of feature importances in the scikit-learn inferface (#3876)
- Add `trees_to_df()` method to dump decision trees as Pandas data frame (#4153)
- Add options to control node shapes in the GraphViz plotting function (#3859)
- Add `xgb_model` option to `XGBClassifier`, to load previously saved model (#4092)
- Passing lists into `DMatrix` is now deprecated (#3970)
* XGBoost4J
- Support multiple feature importance features (#3801)
### Maintenance: Refactor C++ code for legibility and maintainability
* Refactor `hist` algorithm code and add unit tests (#3836)
* Minor refactoring of split evaluator in `gpu_hist` (#3889)
* Removed unused leaf vector field in the tree model (#3989)
* Simplify the tree representation by combining `TreeModel` and `RegTree` classes (#3995)
* Simplify and harden tree expansion code (#4008, #4015)
* De-duplicate parameter classes in the linear model algorithms (#4013)
* Robust handling of ranges with C++20 span in `gpu_exact` and `gpu_coord_descent` (#4020, #4029)
* Simplify tree training code (#3825). Also use Span class for robust handling of ranges.
### Maintenance: testing, continuous integration, build system
* Disallow `std::regex` since it's not supported by GCC 4.8.x (#3870)
* Add multi-GPU tests for coordinate descent algorithm for linear models (#3893, #3974)
* Enforce naming style in Python lint (#3896)
* Refactor Python tests (#3897, #3901): Use pytest exclusively, display full trace upon failure
* Address `DeprecationWarning` when using Python collections (#3909)
* Use correct group for maven site plugin (#3937)
* Jenkins CI is now using on-demand EC2 instances exclusively, due to unreliability of Spot instances (#3948)
* Better GPU performance logging (#3945)
* Fix GPU tests on machines with only 1 GPU (#4053)
* Eliminate CRAN check warnings and notes (#3988)
* Add unit tests for tree serialization (#3989)
* Add unit tests for tree fitting functions in `hist` (#4155)
* Add a unit test for `gpu_exact` algorithm (#4020)
* Correct JVM CMake GPU flag (#4071)
* Fix failing Travis CI on Mac (#4086)
* Speed up Jenkins by not compiling CMake (#4099)
* Analyze C++ and CUDA code using clang-tidy, as part of Jenkins CI pipeline (#4034)
* Fix broken R test: Install Homebrew GCC (#4142)
* Check for empty datasets in GPU unit tests (#4151)
* Fix Windows compilation (#4139)
* Comply with latest convention of cpplint (#4157)
* Fix a unit test in `gpu_hist` (#4158)
* Speed up data generation in Python tests (#4164)
### Usability Improvements
* Add link to [InfoWorld 2019 Technology of the Year Award](https://www.infoworld.com/article/3336072/application-development/infoworlds-2019-technology-of-the-year-award-winners.html) (#4116)
* Remove outdated AWS YARN tutorial (#3885)
* Document current limitation in number of features (#3886)
* Remove unnecessary warning when `gblinear` is selected (#3888)
* Document limitation of CSV parser: header not supported (#3934)
* Log training parameters in XGBoost4J-Spark (#4091)
* Clarify early stopping behavior in the scikit-learn interface (#3967)
* Clarify behavior of `max_depth` parameter (#4078)
* Revise Python docstrings for ranking task (#4121). In particular, weights must be per-group in learning-to-rank setting.
* Document parameter `num_parallel_tree` (#4022)
* Add Jenkins status badge (#4090)
* Warn users against using internal functions of `Booster` object (#4066)
* Reformat `benchmark_tree.py` to comply with Python style convention (#4126)
* Clarify a comment in `objectiveTrait` (#4174)
* Fix typos and broken links in documentation (#3890, #3872, #3902, #3919, #3975, #4027, #4156, #4167)
### Acknowledgement
**Contributors** (in no particular order): Jiaming Yuan (@trivialfis), Hyunsu Cho (@hcho3), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Yanbo Liang (@yanboliang), Andy Adinets (@canonizer), Tong He (@hetong007), Yuan Tang (@terrytangyuan)
**First-time Contributors** (in no particular order): Jelle Zijlstra (@JelleZijlstra), Jiacheng Xu (@jiachengxu), @ajing, Kashif Rasul (@kashif), @theycallhimavi, Joey Gao (@pjgao), Prabakaran Kumaresshan (@nixphix), Huafeng Wang (@huafengw), @lyxthe, Sam Wilkinson (@scwilkinson), Tatsuhito Kato (@stabacov), Shayak Banerjee (@shayakbanerjee), Kodi Arfer (@Kodiologist), @KyleLi1985, Egor Smirnov (@SmirnovEgorRu), @tmitanitky, Pasha Stetsenko (@st-pasha), Kenichi Nagahara (@keni-chi), Abhai Kollara Dilip (@abhaikollara), Patrick Ford (@pford221), @hshujuan, Matthew Jones (@mt-jones), Thejaswi Rao (@teju85), Adam November (@anovember)
**First-time Reviewers** (in no particular order): Mingyang Hu (@mingyang), Theodore Vasiloudis (@thvasilo), Jakub Troszok (@troszok), Rong Ou (@rongou), @Denisevi4, Matthew Jones (@mt-jones), Jeff Kaplan (@jeffdk)
## v0.81 (2018.11.04)
### New feature: feature interaction constraints
* Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466).
* [Tutorial](https://xgboost.readthedocs.io/en/release_0.81/tutorials/feature_interaction_constraint.html) is available, as well as [R](https://github.com/dmlc/xgboost/blob/9254c58e4dfff6a59dc0829a2ceb02e45ed17cd0/R-package/demo/interaction_constraints.R) and [Python](https://github.com/dmlc/xgboost/blob/9254c58e4dfff6a59dc0829a2ceb02e45ed17cd0/tests/python/test_interaction_constraints.py) examples.
### New feature: learning to rank using scikit-learn interface
* Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline.
* Examples of using `XGBRanker` class is found at [demo/rank/rank_sklearn.py](https://github.com/dmlc/xgboost/blob/24a268a2e3cb17302db3d72da8f04016b7d352d9/demo/rank/rank_sklearn.py).
### New feature: R interface for SHAP interactions
* SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636).
### New feature: GPU predictor now use multiple GPUs to predict
* GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738)
### New feature: Scale distributed XGBoost to large-scale clusters
* Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing `select()` based AllReduce/Broadcast with `poll()` based implementation.
* Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker.
* With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there.
### New feature: Additional objective functions for GPUs
* New objective functions ported to GPU: `hinge`, `multi:softmax`, `multi:softprob`, `count:poisson`, `reg:gamma`, `"reg:tweedie`.
* With supported objectives, XGBoost will select the correct devices based on your system and `n_gpus` parameter.
### Major bug fix: learning to rank with XGBoost4J-Spark
* Previously, `repartitionForData` would shuffle data and lose ordering necessary for ranking task.
* To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750).
### Major bug fix: early stopping fixed in XGBoost4J-Spark
* Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize)
* A parameter `maximize_evaluation_metrics` is defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics.
### API changes
* Column sampling by level (`colsample_bylevel`) is now functional for `hist` algorithm (#3635, #3862)
* GPU tag `gpu:` for regression objectives are now deprecated. XGBoost will select the correct devices automatically (#3643)
* Add `disable_default_eval_metric` parameter to disable default metric (#3606)
* Experimental AVX support for gradient computation is removed (#3752)
* XGBoost4J-Spark
- Add `rank:ndcg` and `rank:map` to supported objectives (#3697)
* Python package
- Add `callbacks` argument to `fit()` function of sciki-learn API (#3682)
- Add `XGBRanker` to scikit-learn interface (#3560, #3848)
- Add `validate_features` argument to `predict()` function of scikit-learn API (#3653)
- Allow scikit-learn grid search over parameters specified as keyword arguments (#3791)
- Add `coef_` and `intercept_` as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties.
### Performance improvements
* Address very high GPU memory usage for large data (#3635)
* Fix performance regression within `EvaluateSplits()` of `gpu_hist` algorithm. (#3680)
### Bug-fixes
* Fix a problem in GPU quantile sketch with tiny instance weights. (#3628)
* Fix copy constructor for `HostDeviceVectorImpl` to prevent dangling pointers (#3657)
* Fix a bug in partitioned file loading (#3673)
* Fixed an uninitialized pointer in `gpu_hist` (#3703)
* Reshared data among GPUs when number of GPUs is changed (#3721)
* Add back `max_delta_step` to split evaluation (#3668)
* Do not round up integer thresholds for integer features in JSON dump (#3717)
* Use `dmlc::TemporaryDirectory` to handle temporaries in cross-platform way (#3783)
* Fix accuracy problem with `gpu_hist` when `min_child_weight` and `lambda` are set to 0 (#3793)
* Make sure that `tree_method` parameter is recognized and not silently ignored (#3849)
* XGBoost4J-Spark
- Make sure `thresholds` are considered when executing `predict()` method (#3577)
- Avoid losing precision when computing probabilities by converting to `Double` early (#3576)
- `getTreeLimit()` should return `Int` (#3602)
- Fix checkpoint serialization on HDFS (#3614)
- Throw `ControlThrowable` instead of `InterruptedException` so that it is properly re-thrown (#3632)
- Remove extraneous output to stdout (#3665)
- Allow specification of task type for custom objectives and evaluations (#3646)
- Fix distributed updater check (#3739)
- Fix issue when spark job execution thread cannot return before we execute `first()` (#3758)
* Python package
- Fix accessing `DMatrix.handle` before it is set (#3599)
- `XGBClassifier.predict()` should return margin scores when `output_margin` is set to true (#3651)
- Early stopping callback should maximize metric of form `NDCG@n-` (#3685)
- Preserve feature names when slicing `DMatrix` (#3766)
* R package
- Replace `nround` with `nrounds` to match actual parameter (#3592)
- Amend `xgb.createFolds` to handle classes of a single element (#3630)
- Fix buggy random generator and make `colsample_bytree` functional (#3781)
### Maintenance: testing, continuous integration, build system
* Add sanitizers tests to Travis CI (#3557)
* Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669)
* Comply with CRAN submission policy (#3660, #3728)
* Remove copy-paste error in JVM test suite (#3692)
* Disable flaky tests in `R-package/tests/testthat/test_update.R` (#3723)
* Make Python tests compatible with scikit-learn 0.20 release (#3731)
* Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736)
* Add multi-GPU unit test environment (#3741)
* Allow plug-ins to be built by CMake (#3752)
* Test wheel compatibility on CPU containers for pull requests (#3762)
* Fix broken doc build due to Matplotlib 3.0 release (#3764)
* Produce `xgboost.so` for XGBoost-R on Mac OSX, so that `make install` works (#3767)
* Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777)
* Add basic unit tests for `gpu_hist` algorithm (#3785)
* Fix Python environment for distributed unit tests (#3806)
* Test wheels on CUDA 10.0 container for compatibility (#3838)
* Fix JVM doc build (#3853)
### Maintenance: Refactor C++ code for legibility and maintainability
* Merge generic device helper functions into `GPUSet` class (#3626)
* Re-factor column sampling logic into `ColumnSampler` class (#3635, #3637)
* Replace `std::vector` with `HostDeviceVector` in `MetaInfo` and `SparsePage` (#3446)
* Simplify `DMatrix` class (#3395)
* De-duplicate CPU/GPU code using `Transform` class (#3643, #3751)
* Remove obsoleted `QuantileHistMaker` class (#3761)
* Remove obsoleted `NoConstraint` class (#3792)
### Other Features
* C++20-compliant Span class for safe pointer indexing (#3548, #3588)
* Add helper functions to manipulate multiple GPU devices (#3693)
* XGBoost4J-Spark
- Allow specifying host ip from the `xgboost-tracker.properties file` (#3833). This comes in handy when `hosts` files doesn't correctly define localhost.
### Usability Improvements
* Add reference to GitHub repository in `pom.xml` of JVM packages (#3589)
* Add R demo of multi-class classification (#3695)
* Document JSON dump functionality (#3600, #3603)
* Document CUDA requirement and lack of external memory for GPU algorithms (#3624)
* Document LambdaMART objectives, both pairwise and listwise (#3672)
* Document `aucpr` evaluation metric (#3687)
* Document gblinear parameters: `feature_selector` and `top_k` (#3780)
* Add instructions for using MinGW-built XGBoost with Python. (#3774)
* Removed nonexistent parameter `use_buffer` from documentation (#3610)
* Update Python API doc to include all classes and members (#3619, #3682)
* Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852)
* Binary classification demo should produce LIBSVM with 0-based indexing (#3652)
* Process data once for Python and CLI examples of learning to rank (#3666)
* Include full text of Apache 2.0 license in the repository (#3698)
* Save predictor parameters in model file (#3856)
* JVM packages
- Let users specify feature names when calling `getModelDump` and `getFeatureScore` (#3733)
- Warn the user about the lack of over-the-wire encryption (#3667)
- Fix errors in examples (#3719)
- Document choice of trackers (#3831)
- Document that vanilla Apache Spark is required (#3854)
* Python package
- Document that custom objective can't contain colon (:) (#3601)
- Show a better error message for failed library loading (#3690)
- Document that feature importance is unavailable for non-tree learners (#3765)
- Document behavior of `get_fscore()` for zero-importance features (#3763)
- Recommend pickling as the way to save `XGBClassifier` / `XGBRegressor` / `XGBRanker` (#3829)
* R package
- Enlarge variable importance plot to make it more visible (#3820)
### BREAKING CHANGES
* External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all `*.page` files before resuming training. Model serialization is unaffected.
### Known issues
* Quantile sketcher fails to produce any quantile for some edge cases (#2943)
* The `hist` algorithm leaks memory when used with learning rate decay callback (#3579)
* Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
* Early stopping doesn't work with `gblinear` learner (#3789)
* Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the `DMatrix` object and re-load.
* The `DMatrix` Python objects are initialized with incorrect values when given array slices (#3841)
* The `gpu_id` parameter is broken and not yet properly supported (#3850)
### Acknowledgement
**Contributors** (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry)
**First-time Contributors** (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), Boris Filippov (@frenzykryger), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH
**First-time Reviewers** (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel)
## v0.80 (2018.08.13)
* **JVM packages received a major upgrade**: To consolidate the APIs and improve the user experience, we refactored the design of XGBoost4J-Spark in a significant manner. (#3387)
- Consolidated APIs: It is now much easier to integrate XGBoost models into a Spark ML pipeline. Users can control behaviors like output leaf prediction results by setting corresponding column names. Training is now more consistent with other Estimators in Spark MLLIB: there is now one single method `fit()` to train decision trees.
- Better user experience: we refactored the parameters relevant modules in XGBoost4J-Spark to provide both camel-case (Spark ML style) and underscore (XGBoost style) parameters
- A brand-new tutorial is [available](https://xgboost.readthedocs.io/en/release_0.80/jvm/xgboost4j_spark_tutorial.html) for XGBoost4J-Spark.
- Latest API documentation is now hosted at https://xgboost.readthedocs.io/.
* XGBoost documentation now keeps track of multiple versions:
* Support for per-group weights in ranking objective (#3379)
* Fix inaccurate decimal parsing (#3546)
* New functionality
- Query ID column support in LIBSVM data files (#2749). This is convenient for performing ranking task in distributed setting.
- Hinge loss for binary classification (`binary:hinge`) (#3477)
- Ability to specify delimiter and instance weight column for CSV files (#3546)
- Ability to use 1-based indexing instead of 0-based (#3546)
* GPU support
- Quantile sketch, binning, and index compression are now performed on GPU, eliminating PCIe transfer for 'gpu_hist' algorithm (#3319, #3393)
- Upgrade to NCCL2 for multi-GPU training (#3404).
- Use shared memory atomics for faster training (#3384).
- Dynamically allocate GPU memory, to prevent large allocations for deep trees (#3519)
- Fix memory copy bug for large files (#3472)
* Python package
- Importing data from Python datatable (#3272)
- Pre-built binary wheels available for 64-bit Linux and Windows (#3424, #3443)
- Add new importance measures 'total_gain', 'total_cover' (#3498)
- Sklearn API now supports saving and loading models (#3192)
- Arbitrary cross validation fold indices (#3353)
- `predict()` function in Sklearn API uses `best_ntree_limit` if available, to make early stopping easier to use (#3445)
- Informational messages are now directed to Python's `print()` rather than standard output (#3438). This way, messages appear inside Jupyter notebooks.
* R package
- Oracle Solaris support, per CRAN policy (#3372)
* JVM packages
- Single-instance prediction (#3464)
- Pre-built JARs are now available from Maven Central (#3401)
- Add NULL pointer check (#3021)
- Consider `spark.task.cpus` when controlling parallelism (#3530)
- Handle missing values in prediction (#3529)
- Eliminate outputs of `System.out` (#3572)
* Refactored C++ DMatrix class for simplicity and de-duplication (#3301)
* Refactored C++ histogram facilities (#3564)
* Refactored constraints / regularization mechanism for split finding (#3335, #3429). Users may specify an elastic net (L2 + L1 regularization) on leaf weights as well as monotonic constraints on test nodes. The refactor will be useful for a future addition of feature interaction constraints.
* Statically link `libstdc++` for MinGW32 (#3430)
* Enable loading from `group`, `base_margin` and `weight` (see [here](http://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#auxiliary-files-for-additional-information)) for Python, R, and JVM packages (#3431)
* Fix model saving for `count:possion` so that `max_delta_step` doesn't get truncated (#3515)
* Fix loading of sparse CSC matrix (#3553)
* Fix incorrect handling of `base_score` parameter for Tweedie regression (#3295)
## v0.72.1 (2018.07.08)
This version is only applicable for the Python package. The content is identical to that of v0.72.
## v0.72 (2018.06.01)
* Starting with this release, we plan to make a new release every two months. See #3252 for more details.
* Fix a pathological behavior (near-zero second-order gradients) in multiclass objective (#3304)
* Tree dumps now use high precision in storing floating-point values (#3298)
* Submodules `rabit` and `dmlc-core` have been brought up to date, bringing bug fixes (#3330, #3221).
* GPU support
- Continuous integration tests for GPU code (#3294, #3309)
- Abstract 1D vector class now works with multiple GPUs (#3287)
- Generate PTX code for most recent architecture (#3316)
- Fix a memory bug on NVIDIA K80 cards (#3293)
- Address performance instability for single-GPU, multi-core machines (#3324)
* Python package
- FreeBSD support (#3247)
- Validation of feature names in `Booster.predict()` is now optional (#3323)
* Updated Sklearn API
- Validation sets now support instance weights (#2354)
- `XGBClassifier.predict_proba()` should not support `output_margin` option. (#3343) See BREAKING CHANGES below.
* R package:
- Better handling of NULL in `print.xgb.Booster()` (#3338)
- Comply with CRAN policy by removing compiler warning suppression (#3329)
- Updated CRAN submission
* JVM packages
- JVM packages will now use the same versioning scheme as other packages (#3253)
- Update Spark to 2.3 (#3254)
- Add scripts to cross-build and deploy artifacts (#3276, #3307)
- Fix a compilation error for Scala 2.10 (#3332)
* BREAKING CHANGES
- `XGBClassifier.predict_proba()` no longer accepts paramter `output_margin`. The paramater makes no sense for `predict_proba()` because the method is to predict class probabilities, not raw margin scores.
## v0.71 (2018.04.11)
* This is a minor release, mainly motivated by issues concerning `pip install`, e.g. #2426, #3189, #3118, and #3194.
With this release, users of Linux and MacOS will be able to run `pip install` for the most part.
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
#' \item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
#' }
#'
#' 2.2. Parameter for Linear Booster
@@ -41,7 +42,7 @@
#' \itemize{
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \itemize{
#' \item \code{reg:linear} linear regression (Default).
#' \item \code{reg:squarederror} Regression with squared loss (Default).
#' \item \code{reg:logistic} logistic regression.
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
@@ -67,7 +68,7 @@
#' the performance of each round's model on mat1 and mat2.
#' @param obj customized objective function. Returns gradient and second order
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
\item \code{max_depth} maximum depth of a tree. Default: 6
\item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
\item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
\item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
\item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
\itemize{
\item \code{reg:linear} linear regression (Default).
\item \code{reg:squarederror} Regression with squared loss (Default).
\item \code{reg:logistic} logistic regression.
\item \code{binary:logistic} logistic regression for binary classification. Output probability.
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
@@ -84,7 +111,7 @@ the performance of each round's model on mat1 and mat2.}
\item{obj}{customized objective function. Returns gradient and second order
Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it producess "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.
In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm.
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
The purpose of this Vignette is to show you how to correctly load and work with an **Xgboost** model that has been dumped to JSON. **Xgboost** internally converts all data to [32-bit floats](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), and the values dumped to JSON are decimal representations of these values. When working with a model that has been parsed from a JSON file, care must be taken to correctly treat:
- the input data, which should be converted to 32-bit floats
- any 32-bit floats that were stored in JSON as decimal representations
- any calculations must be done with 32-bit mathematical operators
## Setup
For the purpose of this tutorial we will load the xgboost, jsonlite, and float packages. We'll also set `digits=22` in our options in case we want to inspect many digits of our results.
```{r}
require(xgboost)
require(jsonlite)
require(float)
options(digits=22)
```
We will create a toy binary logistic model based on the example first provided [here](https://github.com/dmlc/xgboost/issues/3960), so that we can easily understand the structure of the dumped JSON model object. This will allow us to understand where discrepancies can occur and how they should be handled.
```{r}
dates <- c(20180130, 20180130, 20180130,
20180130, 20180130, 20180130,
20180131, 20180131, 20180131,
20180131, 20180131, 20180131,
20180131, 20180131, 20180131,
20180134, 20180134, 20180134)
labels <- c(1, 1, 1,
1, 1, 1,
0, 0, 0,
0, 0, 0,
0, 0, 0,
0, 0, 0)
data <- data.frame(dates = dates, labels=labels)
bst <- xgboost(
data = as.matrix(data$dates),
label = labels,
nthread = 2,
nrounds = 1,
objective = "binary:logistic",
missing = NA,
max_depth = 1
)
```
## Comparing results
We will now dump the model to JSON and attempt to illustrate a variety of issues that can arise, and how to properly deal with them.
The tree JSON shown by the above code-chunk tells us that if the data is less than 20180132, the tree will output the value in the first leaf. Otherwise it will output the value in the second leaf. Let's try to reproduce this manually with the data we have and confirm that it matches the model predictions we've already calculated.
If we round to two decimals, we see that only the elements related to data values of `20180131` don't agree. If we convert the data to floats, they agree:
What's the lesson? If we are going to work with an imported JSON model, any data must be converted to floats first. In this case, since '20180131' cannot be represented as a 32-bit float, it is rounded up to 20180132, as shown here:
```{r}
fl(20180131)
```
### Lesson 2: JSON parameters are 32-bit floats
> All JSON parameters stored as floats must be converted to floats.
Let's now say we do care about numbers past the first two decimals.
```{r}
# test that values are equal
bst_preds_logodds == bst_from_json_logodds
```
None are exactly equal. What happened? Although we've converted the data to 32-bit floats, we also need to convert the JSON parameters to 32-bit floats. Let's do this:
All equal. What's the lesson? If we are going to work with an imported JSON model, any JSON parameters that were stored as floats must also be converted to floats first.
### Lesson 3: Use 32-bit math
> Always use 32-bit numbers and operators
We were able to get the log-odds to agree, so now let's manually calculate the sigmoid of the log-odds. This should agree with the xgboost predictions.
```{r}
bst_preds <- predict(bst,as.matrix(data$dates))
# calculate the predictions casting doubles to floats
None are exactly equal again. What is going on here? Well, since we are using the value `1` in the calcuations, we have introduced a double into the calculation. Because of this, all float values are promoted to 64-bit doubles and the 64-bit version of the exponentiation operator `exp` is also used. On the other hand, xgboost uses the 32-bit version of the exponentation operator in its [sigmoid function](https://github.com/dmlc/xgboost/blob/54980b8959680a0da06a3fc0ec776e47c8cbb0a1/src/common/math.h#L25-L27).
How do we fix this? We have to ensure we use the correct datatypes everywhere and the correct operators. If we use only floats, the float library that we have loaded will ensure the 32-bit float exponention operator is applied.
```{r}
# calculate the predictions casting doubles to floats
All equal. What's the lesson? We have to ensure that all calculations are done with 32-bit floating point operators if we want to reproduce the results that we see with xgboost.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.