- Update SparseDMatrix comment.
- Use a pointer in the bitfield. We will replace the `std::vector<bool>` in `ColumnMatrix` with bitfield.
- Clean up the page source. The timer is removed as it's inaccurate once we swap the mmap pointer into the page.
- Rework the precision metric for both CPU and GPU.
- Mention it in the document.
- Cleanup old support code for GPU ranking metric.
- Deterministic GPU implementation.
* Drop support for classification.
* type.
* use batch shape.
* lint.
* cpu build.
* cpu build.
* lint.
* Tests.
* Fix.
* Cleanup error message.
- Implement a simple `IterSpan` for passing iterators with size.
- Use shared memory for column size counts.
- Use one thread for each sample in row count to reduce atomic operations.
* [CI] Update images that are not related to the binary release.
- Update clang-tidy, prefer tools from the Ubuntu repository.
- Update GPU image to 22.04.
- Small cleanup to the tidy script.
- Remove gpu_jvm, which seems to be unused.
Thrust implementation of `thrust::all_of/any_of/none_of` adopts an early stopping strategy
to bailout early by dividing the input into small batches. This is not ideal for data
validation as we expect all data to be valid. The strategy leads to excessive kernel
launches and stream synchronization.
* Use reduce from dh instead.
- Pass context from booster to DMatrix.
- Use context instead of integer for `n_threads`.
- Check the consistency configuration for `max_bin`.
- Test for all combinations of initialization options.
Previously, we use `libsvm` as default when format is not specified. However, the dmlc
data parser is not particularly robust against errors, and the most common type of error
is undefined format.
Along with which, we will recommend users to use other data loader instead. We will
continue the maintenance of the parsers as it's currently used for many internal tests
including federated learning.
* Create pyproject.toml
* Implement a custom build backend (see below) in packager directory. Build logic from setup.py has been refactored and migrated into the new backend.
* Tested: pip wheel . (build wheel), python -m build --sdist . (source distribution)
* Fix tests with pandas 2.0.
- `is_categorical` is replaced by `is_categorical_dtype`.
- one hot encoding returns boolean type instead of integer type.
* [dask] Return the first valid booster instead of all valid ones.
- Reduce memory footprint of the returned model.
* mypy error.
* lint.
* duplicated.
Added some more tests for the learner and fit_stump, for both column-wise distributed learning and vertical federated learning.
Also moved the `IsRowSplit` and `IsColumnSplit` methods from the `DMatrix` to the `MetaInfo` since in some places we only have access to the `MetaInfo`. Added a new convenience method `IsVerticalFederatedLearning`.
Some refactoring of the testing fixtures.
- Fix prediction range.
- Support prediction cache in mt-hist.
- Support model slicing.
- Make the booster a Python iterable by defining `__iter__`.
- Cleanup removed/deprecated parameters.
- A new field in the output model `iteration_indptr` for pointing to the ranges of trees for each iteration.
- Remove parameter serialization in the scikit-learn interface.
The scikit-lear interface `save_model` will save only the model and discard all
hyper-parameters. This is to align with the native XGBoost interface, which distinguishes
the hyper-parameter and model parameters.
With the scikit-learn interface, model parameters are attributes of the estimator. For
instance, `n_features_in_`, `n_classes_` are always accessible with
`estimator.n_features_in_` and `estimator.n_classes_`, but not with the
`estimator.get_params`.
- Define a `load_model` method for classifier to load its own attributes.
- Set n_estimators to None by default.
* Implement multi-target for hist.
- Add new hist tree builder.
- Move data fetchers for tests.
- Dispatch function calls in gbm base on the tree type.
- The new implementation is more strict as only binary labels are accepted. The previous implementation converts values greater than 1 to 1.
- Deterministic GPU. (no atomic add).
- Fix top-k handling.
- Precise definition of MAP. (There are other variants on how to handle top-k).
- Refactor GPU ranking tests.
- Extract the builder from the updater class. We need a new builder for multi-target.
- Extract `UpdateTree`, it can be reused for different builders. Eventually, other tree
updaters can use it as well.
* Make tree model param a private member.
* Number of features and targets are immutable after construction.
This is to reduce the number of places where we can run configuration.
- Pass obj info into tree updater as const pointer.
This way we don't have to initialize the learner model param before configuring gbm, hence
breaking up the dependency of configurations.
- Define a new tree struct embedded in the `RegTree`.
- Provide dispatching functions in `RegTree`.
- Fix some c++-17 warnings about the use of nodiscard (currently we disable the warning on
the CI).
- Use uint32_t instead of size_t for `bst_target_t` as it has a defined size and can be used
as part of dmlc parameter.
- Hide the `Segment` struct inside the categorical split matrix.
* Support sklearn cross validation for ranker.
- Add a convention for X to include a special `qid` column.
sklearn utilities consider only `X`, `y` and `sample_weight` for supervised learning
algorithms, but we need an additional qid array for ranking.
It's important to be able to support the cross validation function in sklearn since all
other tuning functions like grid search are based on cross validation.
* Update to C++17
* Turn off unity build
* Update CMake to 3.18
* Use MSVC 2022 + CUDA 11.8
* Re-create stack for worker images
* Allocate more disk space for Windows
* Tempiorarily disable clang-tidy
* RAPIDS now requires Python 3.10+
* Unpin cuda-python
* Use latest NCCL
* Use Ubuntu 20.04 in RMM image
* Mark failing mgpu test as xfail
* Fix CPU bin compression with categorical data.
* The bug causes the maximum category to be lesser than 256 or the maximum number of bins when
the input data is dense.
* Extract most of the functionality into `DMatrixCache`.
* Move API entry to independent file to reduce dependency on `predictor.h` file.
* Add test.
---------
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Fix quantile tests running on multi-gpus
* Run some gtests with multiple GPUs
* fix mgpu test naming
* Instruct NCCL to print extra logs
* Allocate extra space in /dev/shm to enable NCCL
* use gtest_skip to skip mgpu tests
---------
Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>
* Use array interface for CSC matrix.
Use array interface for CSC matrix and align the interface with CSR and dense.
- Fix nthread issue in the R package DMatrix.
- Unify the behavior of handling `missing` with other inputs.
- Unify the behavior of handling `missing` around R, Python, Java, and Scala DMatrix.
- Expose `num_non_missing` to the JVM interface.
- Deprecate old CSR and CSC constructors.
* Define default ctors for gpair.
Fix clang warning:
Definition of implicit copy assignment operator for 'GradientPairInternal<float>' is
deprecated because it has a user-declared copy constructor
* [jvm-packages] Bump rapids version to 22.12.0
This PR bumps spark version to 3.1.1 and the rapids version
to 22.12.0, which results in the latest xgboost can't run
with the old rapids packages.
We can handle loading the pickle on a CPU-only machine if the XGBoost is built with CUDA
enabled (Linux and Windows PyPI package), but not if the distribution is CPU-only (macOS
PyPI package).
* [CI] Fix CI with updated dependencies.
- Fix jvm package get iris.
* Skip SHAP test for now.
* Revert "Skip SHAP test for now."
This reverts commit 9aa28b4d8aee53fa95d92d2a879c6783ff4b2faa.
* Catch all exceptions.
* Support null value in CUDA array interface.
- Fix for potential null value in array interface.
- Fix incorrect check on mask stride.
* Simple tests.
* Extract mask.
- Replace jvm regex replacement script with mvn command.
- Replace cmake script for python version with python script.
- Automate rest of the manual steps.
The script can handle dev branch, rc release, and formal release version.
* [R] Use new interface for creating DMatrix from CSR.
- CSC is still using the old API.
The old API is not aware of `nthread` parameter, which makes DMatrix to use all available
thread during construction and during transformation lie `SparsePage` -> `CSCPage`.
* Thrust 1.17 removes the experimental/pinned_allocator.
When xgboost is brought into a large project it can
be compiled against Thrust 1.17+ which don't offer
this experimental allocator.
To ensure that going forward xgboost works in all environments we provide a xgboost namespaced version of
the pinned_allocator that previously was in Thrust.
- Use the standard package check (check on the tarball instead of the source tree).
- Run commands in parallel.
- Cleanup dependencies installation.
- Replace makefile.
- Documentation.
- Test using the image from rhub.
- Use rst references instead of doxygen links.
- Replace deprecated functions.
- Add SaveModel; put free step last [skip ci]
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Add back xgboost.rabit for backwards compatibility
* fix my errors
* Fix lint
* Use FutureWarning
Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>
- Bump configure.ac version.
- Remove amalgamation to reduce the build time for a single object with the added benefit that we can use parallel build during development.
- Fix c function prototype warning.
- Remove Windows automake file generation step to make the build script easier to understand.
* Configuration for init estimation.
* Check whether the model needs configuration based on const attribute `ModelFitted`
instead of a mutable state.
* Add parameter `boost_from_average` to tell whether the user has specified base score.
* Add tests.
* Reduce clutter in log of Python test
* Set up BuildKite test analytics
* Add separate step for building containers
* Enable incremental update of CI stack; custom agent IAM policy
* Intoducing Column Wise Hist Building
* linting
* more linting
* bug fixing
* Removing column samping optimization for a while to simplify the review process.
* linting
* Removing unnecessary changes
* Use DispatchBinType in hist_util.cc
* Adding force_read_by column flag to buildhist. Adding tests for column wise buiilhist.
* Introducing new dispatcher for compile time flags in hist building
* fixing bug with using of DispatchBinType
* Fixing building
* Merging with master branch
Co-authored-by: dmitry.razdoburdin <drazdobu@jfldaal005.jf.intel.com>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
- Group C API.
- Add C API sphinx doc.
- Consistent use of `OptionalArg` and the parameter name `config`.
- Remove call to deprecated functions in demo.
- Fix some formatting errors.
- Add links to c examples in the document (only visible with doxygen pages)
- Fix arrow.
* [CI] Use RAPIDS 22.10
* Store CUDA and RAPIDS versions in one place
* Fix
* Add missing #include
* Update gputreeshap submodule
* Fix
* Remove outdated distributed tests
* [jvm-packages] fix spark-rapids compatibility issue
spark-rapids (from 22.10) has shimmed GpuColumnVector, which means
we can't call it directly. So this PR call the UnshimmedGpuColumnVector
* Prepare for improving Windows networking compatibility.
* Include dmlc filesystem indirectly as dmlc/filesystem.h includes windows.h, which
conflicts with winsock2.h
* Define `NOMINMAX` conditionally.
* Link the winsock library when mysys32 is used.
* Add config file for read the doc.
- Use numpy stack for handling list of arrays.
- Reuse concat function from dask.
- Prepare for `QuantileDMatrix`.
- Remove unused code.
- Use iterator for prediction to avoid initializing xgboost model
There is a small typo in src/common/partition_builder.h.
Should read `canonical` rather than `cannonical`.
Signed-off-by: Tim Gates <tim.gates@iress.com>
* [Python] Require black and isort for new Python files.
- Require black and isort for spark and dask module.
These files are relatively new and are more conform to the black formatter. We will
convert the rest of the library as we move forward.
Other libraries including dask/distributed and optuna use the same formatting style and
have a more strict standard. The black formatter is indeed quite nice, automating it can
help us unify the code style.
- Gather Python checks into a single script.
* [PySpark] change the returning model type to string from binary
XGBoost pyspark can be can be accelerated by RAPIDS Accelerator seamlessly by
changing the returning model type from binary to string.
- Use `bst_bin_t` in batch param constructor.
- Use `StringView` to avoid `std::string` when appropriate.
- Avoid using `MetaInfo` in quantile constructor to limit the scope of parameter.
* Split up column matrix initialization.
This PR splits the column matrix initialization into 2 steps, the first one initializes
the storage while the second one does the transpose. By doing so, we can reuse the code
for Quantile DMatrix.
* Fix mypy error with latest dask.
Dask is adding type hints to its codebase and as the result, checks in XGBoost can be
performed more rigorously.
- Remove compatibility with old dask version where multi lock was missing.
- Restrict input of `X` to be non-series.
- Adopt latest definition of `Delayed`.
- Avoid passing optional `host_ip`.
- Avoid deprecated `worker.nthreads`.
* [jvm-packages] fix executor crashing issue when transforming on xgboost4j-spark-gpu
the API XGBoosterSetParam is not thread-safe. Dring the phase of transforming,
XGBoost runs several transforming tasks at a time, and each of them will set
the "gpu_id" and "predictor" parameters, so if several tasks (multi-threads)
all XGBoosterSetParam simultaneously, it may cause the memory to be corrupted
and cause SIGSEGV.
This PR first get the booster from broadcast and set to the correct gpu_id
and predictor, and then all transforming taskes will use the same booster to
do the transforming.
- Remove unused parameters. There are still many warnings that are not yet
addressed. Currently, the warnings in dmlc-core dominate the error log.
- Remove `distributed` parameter from metric.
- Fixes some warnings about signed comparison.
- Optionally switch to c++17
- Use rmm CMake target.
- Workaround compiler errors.
- Fix GPUMetric inheritance.
- Run death tests even if it's built with RMM support.
Co-authored-by: jakirkham <jakirkham@gmail.com>
This PR removes auto-detection of MUSL-based Linux systems in favor of system properties the user can set to configure a specific path for a native library.
* Pass sparse page as adapter, which prepares for quantile dmatrix.
* Remove old external memory code like `rbegin` and extra `Init` function.
* Simplify type dispatch.
Federated learning plugin for xgboost:
* A gRPC server to aggregate MPI-style requests (allgather, allreduce, broadcast) from federated workers.
* A Rabit engine for the federated environment.
* Integration test to simulate federated learning.
Additional followups are needed to address GPU support, better security, and privacy, etc.
Support adaptive tree, a feature supported by both sklearn and lightgbm. The tree leaf is recomputed based on residue of labels and predictions after construction.
For l1 error, the optimal value is the median (50 percentile).
This is marked as experimental support for the following reasons:
- The value is not well defined for distributed training, where we might have empty leaves for local workers. Right now I just use the original leaf value for computing the average with other workers, which might cause significant errors.
- Some follow-ups are required, for exact, pruner, and optimization for quantile function. Also, we need to calculate the initial estimation.
With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.
* Make sure the task is initialized before construction of tree updater.
This is a quick fix meant to be backported to 1.6, for a full fix we should pass the model
param into tree updater by reference instead.
* Skip non-increasing test with external memory when subsample is used.
* Increase bin numbers for boost from prediction test. This mitigates the effect of
non-deterministic partitioning.
* Use the name `Context`.
* Pass a context object into `SetInfo`.
* Add context to proxy matrix.
* Add context to iterative DMatrix.
This is to remove the use of the default number of threads during `SetInfo` as a follow-up on
removing the global omp variable while preparing for CUDA stream semantic. Currently, XGBoost
uses the legacy CUDA stream, we will gradually remove them in the future in favor of non-blocking streams.
* Generate column matrix from gHistIndex.
* Avoid synchronization with the sparse page once the cache is written.
* Cleanups: Remove member variables/functions, change the update routine to look like approx and gpu_hist.
* Remove pruner.
Fix some tests to run in a temporary directory in case the root
directory is not writable. Note that most of tests are already
running in the temporary directory, so this PR just make them
consistent.
* Extract partitioner from hist.
* Implement categorical data support by passing the gradient index directly into the partitioner.
* Organize/update document.
* Remove code for negative hessian.
xgboost4j-spark provides 2 sets of API for setting features, one for CPU, another for GPU, which may cause confusion.
This PR removes the GPU API and adds an override CPU function setFeaturesCol to accept Array[String] parameters.
* Fix copy for cv. This prevents inserting default callbacks into the input list.
* Clarify the behavior of callbacks in training/cv.
* Fix typos in doc.
* Cleanup some pylint errors.
* Cleanup pylint errors in rabit modules.
* Make data iter an abstract class and cleanup private access.
* Cleanup no-self-use for booster.
- Mention standard install command for R package.
- Remove repeated "get source" step.
- Remove troubleshooting on Windows. It's outdated considering VS 2022 is already out.
* Implement `MaxCategory` in quantile.
* Implement partition-based split for GPU evaluation. Currently, it's based on the existing evaluation function.
* Extract an evaluator from GPU Hist to store the needed states.
* Added some CUDA stream/event utilities.
* Update document with references.
* Fixed a bug in approx evaluator where the number of data points is less than the number of categories.
Empty partition is different from empty dataset. For the former case, each worker has
non-empty dask collections, but each collection might contain empty partition.
This PR prepares the GHistIndexMatrix to host the column matrix which is used by the hist tree method by accepting sparse_threshold parameter.
Some cleanups are made to ensure the correct batch param is being passed into DMatrix along with some additional tests for correctness of SimpleDMatrix.
* Add a new utility for mapping function onto workers.
* Unify the type for feature names.
* Clean up the iterator.
* Fix prediction with DaskDMatrix worker specification.
* Fix base margin with DeviceQuantileDMatrix.
* Support vs 2022 in setup.py.
* Replace all uses of deprecated function sklearn.datasets.load_boston
* More renaming
* Fix bad name
* Update assertion
* Fix n boosted rounds.
* Avoid over regularization.
* Rebase.
* Avoid over regularization.
* Whac-a-mole
Co-authored-by: fis <jm.yuan@outlook.com>
This is the one last PR for removing omp global variable.
* Add context object to the `DMatrix`. This bridges `DMatrix` with https://github.com/dmlc/xgboost/issues/7308 .
* Require context to be available at the construction time of booster.
* Add `n_threads` support for R csc DMatrix constructor.
* Remove `omp_get_max_threads` in R glue code.
* Remove threading utilities that rely on omp global variable.
- Add user configuration.
- Bring back to the logic of using scheduler address from dask. This was removed when we were trying to support GKE, now we bring it back and let xgboost try it if direct guess or host IP from user config failed.
Note that when cub inside CUDA is being used, XGBoost performs checks on input size
instead of using internal cub function to accept inputs larger than maximum integer.
* Implement ubjson.
This is a partial implementation of UBJSON with support for typed arrays. Some missing
features are `f64`, typed object, and the no-op.
This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing.
The rewrite has many benefits:
- Support for both `max_leaves` and `max_depth`.
- Support for `grow_policy`.
- Support for mono constraint.
- Support for feature weights.
- Support for easier bin configuration (`max_bin`).
- Support for categorical data.
- Faster performance for most of the datasets. (many times faster)
- Support for prediction cache.
- Significantly better performance for external memory.
- Unites the code base between approx and hist.
Instead of accessing data from the `original_page_`, access the data from the first page of the available batch.
fix#7476
Co-authored-by: jiamingy <jm.yuan@outlook.com>
* Add num target model parameter, which is configured from input labels.
* Change elementwise metric and indexing for weights.
* Add demo.
* Add tests.
* Add a new ctor to tensor for `initilizer_list`.
* Change labels from host device vector to tensor.
* Rename the field from `labels_` to `labels` since it's a public member.
This PR changes base_margin into a 3-dim array, with one of them being reserved for multi-target classification. Also, a breaking change is made for binary serialization due to extra dimension along with a fix for saving the feature weights. Lastly, it unifies the prediction initialization between CPU and GPU. After this PR, the meta info setter in Python will be based on array interface.
* [CI] Drop CUDA 10.1; Require 11.0
* Change NCCL version
* Use CUDA 10.1 for clang-tidy, for now
* Remove JDK 11 and 12
* Fix NCCL version
* Don't require 11.0 just yet, until clang-tidy is fixed
* Skip MultiClassesSerializationTest.GpuHist
* Extend array interface to handle ndarray.
The `ArrayInterface` class is extended to support multi-dim array inputs. Previously this
class handles only 2-dim (vector is also matrix). This PR specifies the expected
dimension at compile-time and the array interface can perform various checks automatically
for input data. Also, adapters like CSR are more rigorous about their input. Lastly, row
vector and column vector are handled without intervention from the caller.
* [R] Fix global feature importance.
* Add implementation for tree index. The parameter is not documented in C API since we
should work on porting the model slicing to R instead of supporting more use of tree
index.
* Fix the difference between "gain" and "total_gain".
* debug.
* Fix prediction.
Change from system Python to environment python3. For Ubuntu 20.04, only `python3` is
available and there's no `python`. So at least `python3` is consistent with Python
virtual env, Ubuntu and anaconda.
This is already partially supported but never properly tested. So the only possible way to use it is calling `numpy.ndarray.flatten` with `base_margin` before passing it into XGBoost. This PR adds proper support
for most of the data types along with tests.
Generated using `clang-format -style=google -dump-config > .clang-format`, with column
width changed from 80 to 100 to be consistent with existing cpplint check.
Spark 3.2 depends on 3.7.0-M11 which has changed some implicited functions'
signatures. And it will result the xgboost4j built against spark 3.0/3.1
failed when saving the model.
A new parameter `custom_metric` is added to `train` and `cv` to distinguish the behaviour from the old `feval`. And `feval` is deprecated. The new `custom_metric` receives transformed prediction when the built-in objective is used. This enables XGBoost to use cost functions from other libraries like scikit-learn directly without going through the definition of the link function.
`eval_metric` and `early_stopping_rounds` in sklearn interface are moved from `fit` to `__init__` and is now saved as part of the scikit-learn model. The old ones in `fit` function are now deprecated. The new `eval_metric` in `__init__` has the same new behaviour as `custom_metric`.
Added more detailed documents for the behaviour of custom objective and metric.
Following classes are added to support dataframe in java binding:
- `Column` is an abstract type for a single column in tabular data.
- `ColumnBatch` is an abstract type for dataframe.
- `CuDFColumn` is an implementaiton of `Column` that consume cuDF column
- `CudfColumnBatch` is an implementation of `ColumnBatch` that consumes cuDF dataframe.
- `DeviceQuantileDMatrix` is the interface for quantized data.
The Java implementation mimics the Python interface and uses `__cuda_array_interface__` protocol for memory indexing. One difference is on JVM package, the data batch is staged on the host as java iterators cannot be reset.
Co-authored-by: jiamingy <jm.yuan@outlook.com>
* Support more input types for categorical data.
* Shorten the type name from "categorical" to "c".
* Tests for np/cp array and scipy csr/csc/coo.
* Specify the type for feature info.
* Add hessian to batch param in preparation of new approx impl.
* Extract a push method for gradient index matrix.
* Use span instead of vector ref for hessian in sketching.
* Create a binary format for gradient index.
On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor.
[breaking] Drop non-deterministic histogram.
Use fixed point for shared memory.
This PR is to improve the performance of GPU Hist.
Co-authored-by: Andy Adinets <aadinets@nvidia.com>
* [CI] Automatically build GPU-enabled R package for Windows
* Update Jenkinsfile-win64
* Build R package for the release branch only
* Update install doc
Fix bug introduced in 17913713b5 (allow loading from byte array)
When loading model from stream, only last buffer read from the input stream is used to construct the model.
This may work for models smaller than 1 MiB (if you are lucky enough to read the whole model at once), but will always fail if the model is larger.
* Work around a segfault observed in SparsePage::Push()
* Revert "Work around a segfault observed in SparsePage::Push()"
This reverts commit 30934844d00908750a5442082eb4769b1489f6a9.
* Don't call vector::resize() inside OpenMP block
* Set GITHUB_PAT env var to fix R tests
* Use built-in GITHUB_TOKEN
* Disallow importing non-dask estimators from xgboost.dask
This is mostly a style change, but also avoids a user error (that I have
committed on a few occasions). Since `XGBRegressor` and `XGBClassifier`
are imported as parent classes for the `dask` estimators, without
defining an `__all__`, autocomplete (or muscle) memory will produce the
following with little prompting:
```
from xgboost.dask import XGBClassifier
```
There's nothing inherently wrong with that, but given that
`XGBClassifier` is not `dask` enabled, it can lead to confusing behavior
until you figure out you should've typed
```
from xgboost.dask import DaskXGBClassifier
```
Another option is to alias import the existing non-dask estimators.
* Remove base/iter class, add train predict funcs
* Use type aliases for discard iterators
* update to include host_vector as thrust 1.12 doesn't bring it in as a side-effect
* cub::DispatchRadixSort requires signed offset types
- Reduce dependency on dmlc parsers and provide an interface for users to load data by themselves.
- Remove use of threaded iterator and IO queue.
- Remove `page_size`.
- Make sure the number of pages in memory is bounded.
- Make sure the cache can not be violated.
- Provide an interface for internal algorithms to process data asynchronously.
The role of ProxyDMatrix is going beyond what it was designed. Now it's used by both
QuantileDeviceDMatrix and inplace prediction. After the refactoring of sparse DMatrix it
will also be used for external memory. Renaming the C API to extract it from
QuantileDeviceDMatrix.
Other than modularizing the split evaluation function, this PR also removes some more functions including `InitNewNodes` and `BuildNodeStats` among some other unused variables. Also, scattered code like setting leaf weights is grouped into the split evaluator and `NodeEntry` is simplified and made private. Another subtle difference with the original implementation is that the modified code doesn't call `tree[nidx].Parent()` to traversal upward.
* Add feature score support for linear model.
* Port R interface to the new implementation.
* Add linear model support in Python.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Support categorical data for dask functional interface and DQM.
* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.
* Get newer cupy.
* Categorical prediction with CPU predictor and GPU predict leaf.
* Implement categorical prediction for CPU prediction.
* Implement categorical prediction for GPU predict leaf.
* Refactor the prediction functions to have a unified get next node function.
Co-authored-by: Shvets Kirill <kirill.shvets@intel.com>
* Change C API name.
* Test for all primitive types from array.
* Add native support for CPU 128 float.
* Convert boolean and float16 in Python.
* Fix dask version for now.
The guard protects the global variable from being changed by XGBoost. But this leads to a
bug that the `n_threads` parameter is no longer used after the first iteration. This is
due to the fact that `omp_set_num_threads` is only called once in `Learner::Configure` at
the beginning of the training process.
The guard is still useful for `gpu_id`, since this is called all the times in our codebase
doesn't matter which iteration we are currently running.
currently installing the R-pacakge will leave the repo in dirty state, since
`CmakeLists.txt` is already checked in. This fixes the `cleanup`
script to not delete this file.
* Add `XGBOOST_RABIT_TRACKER_IP_FOR_TEST` to set rabit tracker IP
* change spark and rabit tracker IP to 127.0.0.1on GitHub Action.
Co-authored-by: fis <jm.yuan@outlook.com>
* Re-implement ROC-AUC.
* Binary
* MultiClass
* LTR
* Add documents.
This PR resolves a few issues:
- Define a value when the dataset is invalid, which can happen if there's an
empty dataset, or when the dataset contains only positive or negative values.
- Define ROC-AUC for multi-class classification.
- Define weighted average value for distributed setting.
- A correct implementation for learning to rank task. Previous
implementation is just binary classification with averaging across groups,
which doesn't measure ordered learning to rank.
The [general documentation](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) clearly has alpha and lambda under its "Parameters for Tree Booster" heading. Furthermore, the R package clearly uses alpha and lambda when told to use the tree booster. This update adds those two parameters to the documentation for the R package.
Closed issue #6763.
* [dask] Use `distributed.MultiLock`
This enables training multiple models in parallel.
* Conditionally import `MultiLock`.
* Use async train directly in scikit learn interface.
* Use `worker_client` when available.
* Ensure RMM is 0.18 or later
* Add use_rmm flag to global configuration
* Modify XGBCachingDeviceAllocatorImpl to skip CUB when use_rmm=True
* Update the demo
* [CI] Pin NumPy to 1.19.4, since NumPy 1.19.5 doesn't work with latest Shap
* Save feature info in booster in JSON model.
* [breaking] Remove automatic feature name generation in `DMatrix`.
This PR is to enable reliable feature validation in Python package.
* Add ability to load booster direct from byte array
* fix compiler error
* move InputStream to byte-buffer conversion
- move it from Booster to XGBoost facade class
* Use normal predictor for dart booster.
* Implement `inplace_predict` for dart.
* Enable `dart` for dask interface now that it's thread-safe.
* categorical data should be working out of box for dart now.
The implementation is not very efficient as it has to pull back the data and
apply weight for each tree, but still a significant improvement over previous
implementation as now we no longer binary search for each sample.
* Fix output prediction shape on dataframe.
* Stop printing out message.
* Remove R specialization.
The printed message is not really useful anyway, without a reproducible example
there's no way to fix it. But if there's a reproducible example, we can always
obtain these information by a debugger. Removing the `printf` function avoids
creating the context in kernel.
* Add a new API function for predicting on `DMatrix`. This function aligns
with rest of the `XGBoosterPredictFrom*` functions on semantic of function
arguments.
* Purge `ntree_limit` from libxgboost, use iteration instead.
* [dask] Use `inplace_predict` by default for dask sklearn models.
* [dask] Run prediction shape inference on worker instead of client.
The breaking change is in the Python sklearn `apply` function, I made it to be
consistent with other prediction functions where `best_iteration` is used by
default.
* Accept array interface for csr and array.
* Accept an optional proxy dmatrix for metainfo.
This constructs an explicit `_ProxyDMatrix` type in Python.
* Remove unused doc.
* Add strict output.
This PR changes predict and inplace_predict to accept a Future of model, to avoid sending models to workers repeatably.
* Document is updated to reflect functionality additions in recent changes.
* [dask] Use a 1 line sample to infer output shape.
This is for inferring shape with direct prediction (without DaskDMatrix).
There are a few things that requires known output shape before carrying out
actual prediction, including dask meta data, output dataframe columns.
* Infer output shape based on local prediction.
* Remove set param in predict function as it's not thread safe nor necessary as
we now let dask to decide the parallelism.
* Simplify prediction on `DaskDMatrix`.
This PR ensures all DMatrix types have a common interface.
* Fix logic in avoiding duplicated DMatrix in sklearn.
* Check for consistency between DMatrix types.
* Add doc for bounds.
* [java] extending the library loader to use both OS and CPU architecture.
* Simplifying create_jni.py's architecture detection.
* Tidying up the architecture detection in create_jni.py
The old (before fix) best_ntree_limit ignores the num_class parameters, which is incorrect. In before we workarounded it in c++ layer to avoid possible breaking changes on other language bindings. But the Python interpretation stayed incorrect. The PR fixed that in Python to consider num_class, but didn't remove the old workaround, so tree calculation in predictor is incorrect, see PredictBatch in CPUPredictor.
* Initial support for distributed LTR using dask.
* Support `qid` in libxgboost.
* Refactor `predict` and `n_features_in_`, `best_[score/iteration/ntree_limit]`
to avoid duplicated code.
* Define `DaskXGBRanker`.
The dask ranker doesn't support group structure, instead it uses query id and
convert to group ptr internally.
* Update dmlc-core submodule and conform to new API
* Remove unsupported parameter from method signature
* Update dmlc-core submodule and conform to new API
* Update dmlc-core
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* For sklearn:
- Handles user defined objective function.
- Handles `softmax`.
* For dask:
- Use the implementation from sklearn, the previous implementation doesn't perform any extra handling.
* Calling XGBModel.fit() should clear the Booster by default
* Document the behavior of fit()
* Allow sklearn object to be passed in directly via xgb_model argument
* Fix lint
For the `gamma-nloglik` eval metric, small positive values in the labels are causing `NaN`'s in the outputs, as reported here: https://github.com/dmlc/xgboost/issues/5349. This will add clipping on them, similar to what is done in other metrics like `poisson-nloglik` and `logloss`.
* Implement early stopping with training continuation.
* Add new C API for obtaining boosted rounds.
* Fix off by 1 in `save_best`.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Enable loading model from <1.0.0 trained with objective='binary:logitraw'
* Add binary:logitraw in model compatibility testing suite
* Feedback from @trivialfis: Override ProbToMargin() for LogisticRaw
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* [CI] Upgrade cuDF and RMM to 0.18 nightlies
* Modify RMM plugin to be compatible with RMM 0.18
* Update src/common/device_helpers.cuh
Co-authored-by: Mark Harris <mharris@nvidia.com>
Co-authored-by: Mark Harris <mharris@nvidia.com>
* Vendor libgomp in the manylinux2014_aarch64 wheel
* Use vault repo, since CentOS 6 has reached End-of-Life on Nov 30
* Vendor libgomp in the manylinux2010_x86_64 wheel
* Run verification step inside the container
* Add management functions for global configuration: XGBSetGlobalConfig(), XGBGetGlobalConfig().
* Add Python interface: set_config(), get_config(), and config_context().
* Add unit tests for Python
* Add R interface: xgb.set.config(), xgb.get.config()
* Add unit tests for R
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Do not derive from unittest.TestCase (not needed for pytest)
* assertRaises -> pytest.raises
* Simplify test_empty_dmatrix with test parametrization
* setUpClass -> setup_class, tearDownClass -> teardown_class
* Don't import unittest; import pytest
* Use plain assert
* Use parametrized tests in more places
* Fix test_gpu_with_sklearn.py
* Put back run_empty_dmatrix_reg / run_empty_dmatrix_cls
* Fix test_eta_decay_gpu_hist
* Add parametrized tests for monotone constraints
* Fix test names
* Remove test parametrization
* Revise test_slice to be not flaky
Deprecate positional arguments in following functions:
- `__init__` for all classes in sklearn module.
- `fit` method for all classes in sklearn module.
- dask interface.
- `set_info` for `DMatrix` class.
Refactor the evaluation matrices handling.
* [CI] Add noLD test
* Make noLD test only trigger with a PR comment
* [CI] Don't install stringi
* Add the Titanic example as a unit test
* Document trigger
* add to index
* Clarify that it needs to be a review comment
* Remove R check from Jenkins
* Print stacktrace when CRAN test fail in GitHub Actions
* Add verbose flag in tests/ci_build/print_r_stacktrace.sh
* Fix path in tests/ci_build/print_r_stacktrace.sh
* Make external memory data partitioning deterministic.
* Change the meaning of `page_size` from bytes to number of rows.
* Design a data pool.
* Note for external memory.
* Enable unity build on Windows CI.
* Force garbage collect on test.
This PR is meant the end the confusion around best_ntree_limit and unify model slicing. We have multi-class and random forests, asking users to understand how to set ntree_limit is difficult and error prone.
* Implement the save_best option in early stopping.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Removed some warnings
* Rebase with master
* Solved C++ Google Tests errors made by refactoring in order to remove warnings
* Undo renaming path -> path_
* Fix style check
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
CLI is not most developed interface. Putting them into correct directory can help new users to avoid it as most of the use cases are from a language binding.
* Fix warnings for json.h
* Fix warnings for metric.h
* Fix warnings for updater_quantile_hist.cc.
* Fix warnings for updater_histmaker.cc.
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Deprecate LabelEncoder in XGBClassifier; skip LabelEncoder for cuDF/cuPy inputs
* Add unit tests for cuDF and cuPy inputs with XGBClassifier
* Fix lint
* Clarify warning
* Move use_label_encoder option to XGBClassifier constructor
* Add a test for cudf.Series
* Add use_label_encoder to XGBRFClassifier doc
* Address reviewer feedback
* Now it's built as part of libxgboost.
* Set correct C API error in RABIT initialization and finalization.
* Remove redundant message.
* Guard the tracker print C API.
* Disable JSON serialization for now.
* Multi-class classification is checkpointing for each iteration.
This brings significant overhead.
Revert: 90355b4f00
* Set R tests to use binary.
* [CI] Clean up build for JVM packages
* Use correct path for saving native lib
* Fix groupId of maven-surefire-plugin
* Fix stashing of xgboost4j_jar_gpu
* [CI] Don't run xgboost4j-tester with GPU, since it doesn't use gpu_hist
* Change DefaultEvalMetric of classification from error to logloss
* Change default binary metric in plugin/example/custom_obj.cc
* Set old error metric in python tests
* Set old error metric in R tests
* Fix missed eval metrics and typos in R tests
* Fix setting eval_metric twice in R tests
* Add warning for empty eval_metric for classification
* Fix Dask tests
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Supply `-G;-src-in-ptx` when `USE_DEVICE_DEBUG` is set and debug mode is selected.
* Refactor CMake script to gather all CUDA configuration.
* Use CMAKE_CUDA_ARCHITECTURES. Close#6029.
* Add compute 80. Close#5999
* Fall back to CUB allocator if RMM memory pool is not set up
* Fix build
* Prevent memory leak
* Add note about lack of memory initialisation
* Add check for other fast allocators
* Set use_cub_allocator_ to true when RMM is not enabled
* Fix clang-tidy
* Do not demangle symbol; add check to ensure Linux+Clang/GCC combo
* [R] Fix empty empty tests and a test warnings
* [R] Remove stringi dependency (fix#5905)
* Fix R lint check
* [R] Fix automatic conversion to factor in R < 4.0.0 in xgb.model.dt.tree
* Add `R` Makefile variable
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Workaround a compiler bug in MacOS AppleClang
* [CI] Run C++ test with MacOS Catalina + AppleClang 11.0.3
* [CI] Migrate cmake_test on MacOS from Travis CI to GitHub Actions
* Install OpenMP runtime
* [CI] Use CMake to locate lz4 lib
* Add getNumFeature to the Java API
* Add getNumFeature to the Scala API
* Add unit tests for getNumFeature
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Fix CMake build with BUILD_STATIC_LIB option
* Disable BUILD_STATIC_LIB option when R/JVM pkg is enabled
* Add objxgboost to install target only when BUILD_STATIC_LIB=ON
* cancel job instead of killing SparkContext
This PR changes the default behavior that kills SparkContext. Instead, This PR
cancels jobs when coming across task failed. That means the SparkContext is
still alive even some exceptions happen.
* add a parameter to control if killing SparkContext
* cancel the jobs the failed task belongs to
* remove the jobId from the map when one job failed.
* resolve comments
We propose to only use the rowHashCode to compute the partitionKey, adding the FeatureValue hashCode does not bring more value and would make the computation slower. Even though a collision would appear at 0.2% with MurmurHash3 this is bearable for partitioning, this won't have any impact on the data balancing.
* Modin DF support
* mode change
* tests were added, ci env was extended
* mode change
* Remove redundant installation of modin
* Add a pytest skip marker for modin
* Install Modin[ray] from PyPI
* fix interfering
* avoid extra conversion
* delete cv test for modin
* revert cv function
Co-authored-by: ShvetsKS <kirill.shvets@intel.com>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* [CI] Improve JVM test in GitHub Actions
* Use env var for Wagon options [skip ci]
* Move the retry flag to pom.xml [skip ci]
* Export env var RABIT_MOCK to run Spark tests [skip ci]
* Correct location of env var
* Re-try up to 5 times [skip ci]
* Don't run distributed training test on Windows
* Fix typo
* Update main.yml
* Fix a unit test on CLI, to handle RC versions
* [CI] Use mgpu machine to run gpu hist unit tests
* [CI] Build GPU-enabled JAR artifact and deploy to xgboost-maven-repo
* [CI] Move lint to GitHub Actions
* [CI] Move Doxygen to GitHub Actions
* [CI] Move Sphinx build test to GitHub Actions
* [CI] Reduce workload for Windows R tests
* [CI] Move clang-tidy to Build stage
The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold.
Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>
* add SHAP summary plot using ggplot2
* Update xgb.plot.shap
* Update example in xgb.plot.shap documentation
* update logic, add tests
* whitespace fixes
* whitespace fixes for test_helpers
* namespace for sd function
* explicitly declare variables that are automatically evaluated by data.table
* Fix R lint
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* fixed some endian issues
* Use dmlc::ByteSwap() to simplify code
* Fix lint check
* [CI] Add test for s390x
* Download latest CMake on s390x
* Fix a bug in my code
* Save magic number in dmatrix with byteswap on big-endian machine
* Save version in binary with byteswap on big-endian machine
* Load scalar with byteswap in MetaInfo
* Add a debugging message
* Handle arrays correctly when byteswapping
* EOF can also be 255
* Handle magic number in MetaInfo carefully
* Skip Tree.Load test for big-endian, since the test manually builds little-endian binary model
* Handle missing packages in Python tests
* Don't use boto3 in model compatibility tests
* Add s390 Docker file for local testing
* Add model compatibility tests
* Add R compatibility test
* Revert "Add R compatibility test"
This reverts commit c2d2bdcb7dbae133cbb927fcd20f7e83ee2b18a8.
Co-authored-by: Qi Zhang <q.zhang@ibm.com>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* [CI] Add RMM as an optional dependency
* Replace caching allocator with pool allocator from RMM
* Revert "Replace caching allocator with pool allocator from RMM"
This reverts commit e15845d4e72e890c2babe31a988b26503a7d9038.
* Use rmm::mr::get_default_resource()
* Try setting default resource (doesn't work yet)
* Allocate pool_mr in the heap
* Prevent leaking pool_mr handle
* Separate EXPECT_DEATH() in separate test suite suffixed DeathTest
* Turn off death tests for RMM
* Address reviewer's feedback
* Prevent leaking of cuda_mr
* Fix Jenkinsfile syntax
* Remove unnecessary function in Jenkinsfile
* [CI] Install NCCL into RMM container
* Run Python tests
* Try building with RMM, CUDA 10.0
* Do not use RMM for CUDA 10.0 target
* Actually test for test_rmm flag
* Fix TestPythonGPU
* Use CNMeM allocator, since pool allocator doesn't yet support multiGPU
* Use 10.0 container to build RMM-enabled XGBoost
* Revert "Use 10.0 container to build RMM-enabled XGBoost"
This reverts commit 789021fa31112e25b683aef39fff375403060141.
* Fix Jenkinsfile
* [CI] Assign larger /dev/shm to NCCL
* Use 10.2 artifact to run multi-GPU Python tests
* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target
* Rename Conda env rmm_test -> gpu_test
* Use env var to opt into CNMeM pool for C++ tests
* Use identical CUDA version for RMM builds and tests
* Use Pytest fixtures to enable RMM pool in Python tests
* Move RMM to plugin/CMakeLists.txt; use PLUGIN_RMM
* Use per-device MR; use command arg in gtest
* Set CMake prefix path to use Conda env
* Use 0.15 nightly version of RMM
* Remove unnecessary header
* Fix a unit test when cudf is missing
* Add RMM demos
* Remove print()
* Use HostDeviceVector in GPU predictor
* Simplify pytest setup; use LocalCUDACluster fixture
* Address reviewers' commments
Co-authored-by: Hyunsu Cho <chohyu01@cs.wasshington.edu>
* Added plugin with DPC++-based predictor and objective function
* Update CMakeLists.txt
* Update regression_obj_oneapi.cc
* Added README.md for OneAPI plugin
* Added OneAPI predictor support to gbtree
* Update README.md
* Merged kernels in gradient computation. Enabled multiple loss functions with DPC++ backend
* Aligned plugin CMake files with latest master changes. Fixed whitespace typos
* Removed debug output
* [CI] Make oneapi_plugin a CMake target
* Added tests for OneAPI plugin for predictor and obj. functions
* Temporarily switched to default selector for device dispacthing in OneAPI plugin to enable execution in environments without gpus
* Updated readme file.
* Fixed USM usage in predictor
* Removed workaround with explicit templated names for DPC++ kernels
* Fixed warnings in plugin tests
* Fix CMake build of gtest
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Allow non-zero for missing value when training.
* Fix wrong method names.
* Add a unit test
* Move the getter/setter unit test to MissingValueHandlingSuite
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* [CI] Assign larger /dev/shm to NCCL
* Use 10.2 artifact to run multi-GPU Python tests
* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target
* [CI] Move lint to a separate script
* [CI] Improved lintr launcher
* Add lintr as a separate action
* Add custom parsing logic to print out logs
* Fix lintr issues in demos
* Run R demos
* Fix CRAN checks
* Install XGBoost into R env before running lintr
* Install devtools (needed to run demos)
* [jvm-packages] add gpu_hist tree method
* change updater hist to grow_quantile_histmaker
* add gpu scheduling
* pass correct parameters to xgboost library
* remove debug info
* add use.cuda for pom
* add CI for gpu_hist for jvm
* add gpu unit tests
* use gpu node to build jvm
* use nvidia-docker
* Add CLI interface to create_jni.py using argparse
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* [R] Add a compatibility layer to load Booster from an old RDS
* Modify QuantileHistMaker::LoadConfig() to be backward compatible with 1.1.x
* Add a big warning about compatibility in QuantileHistMaker::LoadConfig()
* Add testing suite
* Discourage use of saveRDS() in CRAN doc
* set a minimal reducer msg size. Receive the same data size from parent each time.
* When parent read from a child, check it receive minimal reduce size.
fix bug. Rewrite the minimal reducer size check, make sure it's 1~N times of minimal reduce size
Assume the minimal reduce size is X, the logic here is
1: each child upload total_size of message
2: each parent receive X message at least, up to total_size
3: parent reduce X or NxX or total_size message
4: parent sends X or NxX or total_size message to its parent
4: parent's parent receive X message at least, up to total_size. Then reduce X or NxX or total_size message
6: parent's parent sends X or NxX or total_size message to its children
7: parent receives X or NxX or total_size message, sends to its children
8: child receive X or NxN or total_size message.
During the whole process, each transfer is (1~N)xX Byte message or up to total_size.
if X is larger than total_size, then allreduce allways reduce the whole messages and pass down.
* Follow style check rule
* fix the cpplint check
* fix allreduce_base header seq
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Publish artifacts only on the master and release branches
* Build CUDA only for Compute Capability 7.5 when building PRs
* Run all Windows jobs in a single worker image
* Build nightly XGBoost4J SNAPSHOT JARs with Scala 2.12 only
* Show skipped Python tests on Windows
* Make Graphviz optional for Python tests
* Add back C++ tests
* Unstash xgboost_cpp_tests
* Fix label to CUDA 10.1
* Install cuPy for CUDA 10.1
* Install jsonschema
* Address reviewer's feedback
* Add interval accuracy
* De-virtualize AFT functions
* Lint
* Refactor AFT metric using GPU-CPU reducer
* Fix R build
* Fix build on Windows
* Fix copyright header
* Clang-tidy
* Fix crashing demo
* Fix typos in comment; explain GPU ID
* Remove unnecessary #include
* Add C++ test for interval accuracy
* Fix a bug in accuracy metric: use log pred
* Refactor AFT objective using GPU-CPU Transform
* Lint
* Fix lint
* Use Ninja to speed up build
* Use time, not /usr/bin/time
* Add cpu_build worker class, with concurrency = 1
* Use concurrency = 1 only for CUDA build
* concurrency = 1 for clang-tidy
* Address reviewer's feedback
* Update link to AFT paper
* Implement GK sketching on GPU.
* Strong tests on quantile building.
* Handle sparse dataset by binary searching the column index.
* Hypothesis test on dask.
* Add thread local return entry for DMatrix.
* Save feature name and feature type in binary file.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Increased error in coordinate is mostly due to floating point error.
* Shotgun uses Hogwild!, which is non-deterministic and can have even greater
floating point error.
* Add an option to run brute-force test for JSON round-trip
* Apply reviewer's feedback
* Remove unneeded objects
* Parallel run.
* Max.
* Use signed 64-bit loop var, to support MSVC
* Add exhaustive test to CI
* Run JSON test in Win build worker
* Revert "Run JSON test in Win build worker"
This reverts commit c97b2c7dda37b3585b445d36961605b79552ca89.
* Revert "Add exhaustive test to CI"
This reverts commit c149c2ce9971a07a7289f9b9bc247818afd5a667.
Co-authored-by: fis <jm.yuan@outlook.com>
* Use hypothesis
* Allow int64 array interface for groups
* Add packages to Windows CI
* Add to travis
* Make sure device index is set correctly
* Fix dask-cudf test
* appveyor
* [R-package] replace uses of T and F with TRUE and FALSE
* enable linting
* Remove skip
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* [CI] Use Vault repository to re-gain access to devtoolset-4
* Use manylinux2010 tag
* Update Dockerfile.jvm
* Fix rename_whl.py
* Upgrade Pip, to handle manylinux2010 tag
* Update insert_vcomp140.py
* Update test_python.sh
* Set output margin to True for custom objective in Python and R.
* Add a demo for writing multi-class custom objective function.
* Run tests on selected demos.
* Group aware GPU weighted sketching.
* Distribute group weights to each data point.
* Relax the test.
* Validate input meta info.
* Fix metainfo copy ctor.
* Add inplace prediction for dask-cudf.
* Remove Dockerfile.release, since it's not used anywhere
* Use Conda exclusively in CUDF and GPU containers
* Improve cupy memory copying.
* Add skip marks to tests.
* Add mgpu-cudf category on the CI to run all distributed tests.
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Ensure that configured header (build_config.h) from dmlc-core is picked up by Rabit and XGBoost
* Check which Rabit target is being used
* Use CMake 3.13 in all Jenkins tests
* Upgrade CMake in Travis CI
* Install CMake using Kitware installer
* Remove existing CMake (3.12.4)
* Use devtoolset-6.
* [CI] Use devtoolset-6 because devtoolset-4 is EOL and no longer available
* CUDA 9.0 doesn't work with devtoolset-6; use devtoolset-4 for GPU build only
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
* Add bindings for serialization.
* Change `xgb.save.raw' into full serialization instead of simple model.
* Add `xgb.load.raw' for unserialization.
* Run devtools.
* fix type error
* Validate number of features.
* resolve comments
* add feature size for LabelPoint and DataBatch
* pass the feature size to native
* move feature size validating tests into a separate suite
* resolve comments
Co-authored-by: fis <jm.yuan@outlook.com>
* Robust regularization of AFT gradient and hessian
* Fix AFT doc; expose it to tutorial TOC
* Apply robust regularization to uncensored case too
* Revise unit test slightly
* Fix lint
* Update test_survival.py
* Use GradientPairPrecise
* Remove unused variables
* Set default dtor for SimpleDMatrix to initialize default copy ctor, which is
deleted due to unique ptr.
* Remove commented code.
* Remove warning for calling host function (std::max).
* Remove warning for initialization order.
* Remove warning for unused variables.
Normal prediction with DMatrix is now thread safe with locks. Added inplace prediction is lock free thread safe.
When data is on device (cupy, cudf), the returned data is also on device.
* Implementation for numpy, csr, cudf and cupy.
* Implementation for dask.
* Remove sync in simple dmatrix.
* [WIP] Add lower and upper bounds on the label for survival analysis
* Update test MetaInfo.SaveLoadBinary to account for extra two fields
* Don't clear qids_ for version 2 of MetaInfo
* Add SetInfo() and GetInfo() method for lower and upper bounds
* changes to aft
* Add parameter class for AFT; use enum's to represent distribution and event type
* Add AFT metric
* changes to neg grad to grad
* changes to binomial loss
* changes to overflow
* changes to eps
* changes to code refactoring
* changes to code refactoring
* changes to code refactoring
* Re-factor survival analysis
* Remove aft namespace
* Move function bodies out of AFTNormal and AFTLogistic, to reduce clutter
* Move function bodies out of AFTLoss, to reduce clutter
* Use smart pointer to store AFTDistribution and AFTLoss
* Rename AFTNoiseDistribution enum to AFTDistributionType for clarity
The enum class was not a distribution itself but a distribution type
* Add AFTDistribution::Create() method for convenience
* changes to extreme distribution
* changes to extreme distribution
* changes to extreme
* changes to extreme distribution
* changes to left censored
* deleted cout
* changes to x,mu and sd and code refactoring
* changes to print
* changes to hessian formula in censored and uncensored
* changes to variable names and pow
* changes to Logistic Pdf
* changes to parameter
* Expose lower and upper bound labels to R package
* Use example weights; normalize log likelihood metric
* changes to CHECK
* changes to logistic hessian to standard formula
* changes to logistic formula
* Comply with coding style guideline
* Revert back Rabit submodule
* Revert dmlc-core submodule
* Comply with coding style guideline (clang-tidy)
* Fix an error in AFTLoss::Gradient()
* Add missing files to amalgamation
* Address @RAMitchell's comment: minimize future change in MetaInfo interface
* Fix lint
* Fix compilation error on 32-bit target, when size_t == bst_uint
* Allocate sufficient memory to hold extra label info
* Use OpenMP to speed up
* Fix compilation on Windows
* Address reviewer's feedback
* Add unit tests for probability distributions
* Make Metric subclass of Configurable
* Address reviewer's feedback: Configure() AFT metric
* Add a dummy test for AFT metric configuration
* Complete AFT configuration test; remove debugging print
* Rename AFT parameters
* Clarify test comment
* Add a dummy test for AFT loss for uncensored case
* Fix a bug in AFT loss for uncensored labels
* Complete unit test for AFT loss metric
* Simplify unit tests for AFT metric
* Add unit test to verify aggregate output from AFT metric
* Use EXPECT_* instead of ASSERT_*, so that we run all unit tests
* Use aft_loss_param when serializing AFTObj
This is to be consistent with AFT metric
* Add unit tests for AFT Objective
* Fix OpenMP bug; clarify semantics for shared variables used in OpenMP loops
* Add comments
* Remove AFT prefix from probability distribution; put probability distribution in separate source file
* Add comments
* Define kPI and kEulerMascheroni in probability_distribution.h
* Add probability_distribution.cc to amalgamation
* Remove unnecessary diff
* Address reviewer's feedback: define variables where they're used
* Eliminate all INFs and NANs from AFT loss and gradient
* Add demo
* Add tutorial
* Fix lint
* Use 'survival:aft' to be consistent with 'survival:cox'
* Move sample data to demo/data
* Add visual demo with 1D toy data
* Add Python tests
Co-authored-by: Philip Cho <chohyu01@cs.washington.edu>
* Move thread local entry into Learner.
This is an attempt to workaround CUDA context issue in static variable, where
the CUDA context can be released before device vector.
* Add PredictionEntry to thread local entry.
This eliminates one copy of prediction vector.
* Don't define CUDA C API in a namespace.
* - create a gpu metrics (internal) registry
- the objective is to separate the cpu and gpu implementations such that they evolve
indepedently. to that end, this approach will:
- preserve the same metrics configuration (from the end user perspective)
- internally delegate the responsibility to the gpu metrics builder when there is a
valid device present
- decouple the gpu metrics builder from the cpu ones to prevent misuse
- move away from including the cuda file from within the cc file and segregate the code
via ifdef's
* Use pre-rounding based method to obtain reproducible floating point
summation.
* GPU Hist for regression and classification are bit-by-bit reproducible.
* Add doc.
* Switch to thrust reduce for `node_sum_gradient`.
* Add release note for 1.0.0
* Fix a small bug in the Python script that compiles the list of contributors
* Clarify governance of CI infrastructure; now PMC is formally in charge
* Address reviewer comment
* Fix typo
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
used for metric computation
- it also wraps all the bald device pointers into span.
* Remove f-string, since it's not supported by Python 3.5 (#5330)
* Remove f-string, since it's not supported by Python 3.5
* Add Python 3.5 to CI, to ensure compatibility
* Remove duplicated matplotlib
* Show deprecation notice for Python 3.5
* Fix lint
* Fix lint
* Fix a unit test that mistook MINOR ver for PATCH ver
* Enforce only major version in JSON model schema
* Bump version to 1.1.0-SNAPSHOT
* Added a check call macro in jvm package, prevents executing other functions
from jvm when error occurred in XGBoost. For example, when prediction fails jvm
should not try to allocate memory based on the output prediction size.
Move this function into gbtree, and uses only updater for doing so. As now the predictor knows exactly how many trees to predict, there's no need for it to update the prediction cache.
* Move prediction cache into Learner.
* Clean-ups
- Remove duplicated cache in Learner and GBM.
- Remove ad-hoc fix of invalid cache.
- Remove `PredictFromCache` in predictors.
- Remove prediction cache for linear altogether, as it's only moving the
prediction into training process but doesn't provide any actual overall speed
gain.
- The cache is now unique to Learner, which means the ownership is no longer
shared by any other components.
* Changes
- Add version to prediction cache.
- Use weak ptr to check expired DMatrix.
- Pass shared pointer instead of raw pointer.
The setup.py is rewritten. This new script uses only Python code and provide customized
implementation of setuptools commands. This way users can run most of setuptools commands
just like any other Python libraries.
* Remove setup_pip.py
* Remove soft links.
* Define customized commands.
* Remove shell script.
* Remove makefile script.
* Update the doc for building from source.
* Make pip install xgboost*.tar.gz work by fixing build-python.sh
* Simplify install doc
* Add test
* Install Miniconda for Linux target too
* Build XGBoost only once in sdist
* Try importing xgboost after installation
* Don't set PYTHONPATH env var for sdist test
* Turn xgboost::DataType into C++11 enum class
* New binary serialization format for DMatrix::MetaInfo
* Fix clang-tidy
* Fix c++ test
* Implement new format proposal
* Move helper functions to anonymous namespace; remove unneeded field
* Fix lint
* Add shape.
* Keep only roundtrip test.
* Fix test.
* various fixes
* Update data.cc
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Simplify Scikit-Learn parameter management.
* Copy base class for removing duplicated parameter signatures.
* Set all parameters to None.
* Handle None in set_param.
* Extract the doc.
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Simplify DropTrees calling logic
* Add `training` parameter for prediction method.
* [Breaking]: Add `training` to C API.
* Change for R and Python custom objective.
* Correct comment.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* Fix syncing DMatrix columns.
* notes for tree method.
* Enable feature validation for all interfaces except for jvm.
* Better tests for boosting from predictions.
* Disable validation on JVM.
* Disable parameter validation for now.
Scikit-Learn passes all parameters down to XGBoost, whether they are used or
not.
* Add option `validate_parameters`.
* - implementation of map ranking algorithm
- also effected necessary suggestions mentioned in the earlier ranking pr's
- made some performance improvements to the ndcg algo as well
* Add OpenMP as CMake target
* Require CMake 3.12, to allow linking OpenMP target to objxgboost
* Specify OpenMP compiler flag for CUDA host compiler
* Require CMake 3.16+ if the OS is Mac OSX
* Use AppleClang in Mac tests.
* Update dmlc-core
* Remove `learning_rates`.
It's been deprecated since we have callback.
* Set `before_iteration` of `reset_learning_rate` to False to preserve
the initial learning rate, and comply to the term "reset".
Closes#4709.
* Tests for various `tree_method`.
* Pass pointer to model parameters.
This PR de-duplicates most of the model parameters except the one in
`tree_model.h`. One difficulty is `base_score` is a model property but can be
changed at runtime by objective function. Hence when performing model IO, we
need to save the one provided by users, instead of the one transformed by
objective. Here we created an immutable version of `LearnerModelParam` that
represents the value of model parameter after configuration.
This PR fixes tree weights in dart being ignored when computing contributions.
* Fix ellpack page source link.
* Add tree weights to compute contribution.
- Install wget explicitly to match openssl.
- Install CMake explicitly.
- Use newer miniconda link.
- Reenable unittests.
- gcc@9 + xcode@10 for osx due to missing <_stdio.h>. Other versions of gcc should also work. But as homebrew pour gcc@9 after update by default, so I just stick with latest version.
- Disabled one external memory test for OSX. Not sure about the thread implementation in there and fixing external memory is beyond the scope of this PR.
- Use Python3 with conda in jvm package.
* Extract interaction constraints from split evaluator.
The reason for doing so is mostly for model IO, where num_feature and interaction_constraints are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win.
* Enable inc for approx tree method.
As now the implementation is spited up from evaluator class, it's also enabled for approx method.
* Removing obsoleted code in colmaker.
They are never documented nor actually used in real world. Also there isn't a single test for those code blocks.
* Unifying the types used for row and column.
As the size of input dataset is marching to billion, incorrect use of int is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.
* Disable rabit mock test for now: See #5012 .
* Disable dask-cudf test at prediction for now: See #5003
* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.
Using AUC and AUC-PR still throws an error. See #4663 for a robust fix.
* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.
* Sync number of columns in DMatrix.
As num_feature is required to be the same across all workers in data split
mode.
* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.
* Fix Jenkins' GPU tests.
* Install dask-cuda from source in Jenkins' test.
Now all tests are actually running.
* Restore GPU Hist tree synchronization test.
* Check UUID of running devices.
The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.
* Fix CMake policy and project variables.
Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.
* Fix copying data to CPU
* Fix race condition in cpu predictor.
* Fix duplicated DMatrix construction.
* Don't download extra nccl in CI script.
* Do not store built artifacts in the Jenkins master
* Add wheel renaming script
* Upload wheels to S3 bucket
* Use env.GIT_COMMIT
* Capture git hash correctly
* Add missing import in Jenkinsfile
* Address reviewer's comments
* Put artifacts for pull requests in separate directory
* No wildcard expansion in Windows CMD
* Use `UpdateAllowUnknown' for non-model related parameter.
Model parameter can not pack an additional boolean value due to binary IO
format. This commit deals only with non-model related parameter configuration.
* Add tidy command line arg for use-dmlc-gtest.
* - pairwise ranking objective implementation on gpu
- there are couple of more algorithms (ndcg and map) for which support will be added
as follow-up pr's
- with no label groups defined, get gradient is 90x faster on gpu (120m instance
mortgage dataset)
- it can perform by an order of magnitude faster with ~ 10 groups (and adequate cores
for the cpu implementation)
* Add JSON config to rank obj.
* Use CMake config file for representing version.
* Generate c and Python version file with CMake.
The generated file is written into source tree. But unless XGBoost upgrades
its version, there will be no actual modification. This retains compatibility
with Makefiles for R.
* Add XGBoost version the DMatrix binaries.
* Simplify prefetch detection in CMakeLists.txt
* Apply Configurable to objective functions.
* Apply Model to Learner and Regtree, gbm.
* Add Load/SaveConfig to objs.
* Refactor obj tests to use smart pointer.
* Dummy methods for Save/Load Model.
* Don't set_params at the end of set_state.
* Also fix another issue found in dask prediction.
* Add note about prediction.
Don't support other prediction modes at the moment.
* Move get transpose into cc.
* Clean up headers in host device vector, remove thrust dependency.
* Move span and host device vector into public.
* Install c++ headers.
* Short notes for c and c++.
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Add BigDenseMatrix
* ability to create DMatrix with bigger than Integer.MAX_VALUE size arrays
* uses sun.misc.Unsafe
* make DMatrix test work from a jar as well
* apply openmp simd
* clean __buildin detection, moving windows build check from xgboost project, add openmp support for vectorize reduce
* apply openmp only to rabit
* orgnize rabit signature
* remove is_bootstrap, use load_checkpoint as implict flag
* visual studio don't support latest openmp
* orgnize omp declarations
* replace memory copy with vector cast
* Revert "replace memory copy with vector cast"
This reverts commit 28de4792dcdff40d83d458510d23b7ef0b191d79.
* Revert "orgnize omp declarations"
This reverts commit 31341233d31ce93ccf34d700262b1f3f6690bbfe.
* remove openmp settings, merge into a upcoming pr
* mis
* per feedback, update comments
* Add public group getter for java and scala
* Remove unnecessary param from javadoc
* Fix typo
* Fix another typo
* Add semicolon
* Fix javadoc return statement
* Fix missing return statement
* Add a unit test
* Restrict access to `cfg_` in gbm.
* Verify having correct updaters.
* Remove `grow_global_histmaker`
This updater is the same as `grow_histmaker`. The former is not in our
document so we just remove it.
* support run rabit tests as xgboost subproject using xgboost/dmlc-core
* support tracker config set/get
* remove redudant printf
* remove redudant printf
* add c++0x declaration
* log allreduce/broadcast caller, engine should track caller stack for
investigation
* tracker support binary config format
* Revert "tracker support binary config format"
This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.
* remove caller, prototype fetch allreduce/broadcast results from resbuf
* store cached allreduce/broadcast seq_no to tracker
* allow restore all caches from other nodes
* try new rabit collective cache, todo: recv_link seems down
* link up cache restore with main recovery
* cleanup load cache state
* update cache api
* pass test.mk
* have a working tests
* try to unify check into actionsummary
* more logging to debug distributed hist three method issue
* update rabit interface to support caller signature matching
* splite seq_counter from cur_cache_seq to different variables
* still see issue with inf loop
* support debug print caller as well as allreduce op
* cleanup
* remove get/set cache from model_recover, adding recover in
loadcheckpoint
* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries
* revert caller logs
* fix lint error
* fix engine mpi signature
* support getcache by ref
* allow result buffer presiet to filestream
* add loging
* try fix checkpoint failure recovery case
* use int64_t to avoid overflow caused seq fault
* try avoid int overflow
* try fix checkpoint failure recovery case
* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery
* fix cache seq assert error
* remove loging, handle edge case
* add extensive log to checkpoint state with different seq no
* fix lint errors
* clean up comments before merge back to master
* add logs to allreduce/broadcast/checkpoint
* use unsinged int 32 and give seq no larger range
* address remove allreduce dropseq code segment
* using caller signature to filter bootstrapallreduces
* remove get/set cache from empty
* apply signature to reducer
* apply signature to broadcast
* add key to broadcat log
* fix broadcast signature
* fix default _line value for non linux system
* adding comments, remove sleep(1)
* fix osx build issue
* try fix mpi
* fix doc
* fix engine_empty api
* logging, adding more logs, restore immutable assertion
* print unsinged int with ud
* fix lint
* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine
* comment allreduce/broadcast log
* allow tests run on arm
* enable flag to turn on / off cache
* add log info alert if user choose to enable rabit bootstrap cache
* add rabit_debug setting so user can use config to turn on
* log flags when user turn on rabit_debug
* force rabit restart if tracker assign -1 rank
* use OPENMP to vecotrize reducer
* address comment
* Revert "address comment"
This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.
* fix checkpoint size print 0
* per feedback, remove DISABLEOPEMP, address race condition
* - remove openmp from this pr
- update name from cache to boostrapcache
* add default value of signature macros
* remove openmp from cmake file
* Update src/allreduce_robust.cc
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Update src/allreduce_robust.cc
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* run test with cmake
* remove openmp
* fix cmake based tests
* use cmake test fix darwin .dylib issue
* move around rabit_signature definition due to windows build
* misc, add c++ check in CMakeFile
* per feedback
* resolve CMake file
* update rabit version
* Initial support for cudf integration.
* Add two C APIs for consuming data and metainfo.
* Add CopyFrom for SimpleCSRSource as a generic function to consume the data.
* Add FromDeviceColumnar for consuming device data.
* Add new MetaInfo::SetInfo for consuming label, weight etc.
* Refactor configuration [Part II].
* General changes:
** Remove `Init` methods to avoid ambiguity.
** Remove `Configure(std::map<>)` to avoid redundant copying and prepare for
parameter validation. (`std::vector` is returned from `InitAllowUnknown`).
** Add name to tree updaters for easier debugging.
* Learner changes:
** Make `LearnerImpl` the only source of configuration.
All configurations are stored and carried out by `LearnerImpl::Configure()`.
** Remove booster in C API.
Originally kept for "compatibility reason", but did not state why. So here
we just remove it.
** Add a `metric_names_` field in `LearnerImpl`.
** Remove `LazyInit`. Configuration will always be lazy.
** Run `Configure` before every iteration.
* Predictor changes:
** Allocate both cpu and gpu predictor.
** Remove cpu_predictor from gpu_predictor.
`GBTree` is now used to dispatch the predictor.
** Remove some GPU Predictor tests.
* IO
No IO changes. The binary model format stability is tested by comparing
hashing value of save models between two commits
* bump scala to 2.12 which requires java 8 and also newer flink and akka
* put scala version in artifactId
* fix appveyor
* fix for scaladoc issue that looks like https://github.com/scala/bug/issues/10509
* fix ci_build
* update versions in generate_pom.py
* fix generate_pom.py
* apache does not have a download for spark 2.4.3 distro using scala 2.12 yet, so for now i use a tgz i put on s3
* Upload spark-2.4.3-bin-scala2.12-hadoop2.7.tgz to our own S3
* Update Dockerfile.jvm_cross
* Update Dockerfile.jvm_cross
* Reorganize contributor's doc
* Address comments from @trivialfis
* Address @sriramch's comment: include ABI compatibility guarantee
* Address @rongou's comment
* Postpone ABI compatibility guarantee for now
* provide the readme
* update for format
* reformat
* reformat -2
* update again
* update format
* update w.r.t yinlou's comments
* Add kubernetes tutorial to Table of Contents
* Style edit
* Fix#4630, #4421: Preserve correct ordering between metrics, and always use last metric for early stopping
* Clarify semantics of early stopping in presence of multiple valid sets and metrics
* Add a test
* Fix lint
* _maybe_pandas_xxx should return their arguments unchanged if no pandas installed
* Tests should not assume pandas is installed
* Mark tests which require pandas as such
* Fix external memory for get column batches.
This fixes two bugs:
* Use PushCSC for get column batches.
* Don't remove the created temporary directory before finishing test.
* Check all pages.
* Add to documentation how to build native unit tests
* Add instructions to run Python tests and to use Docker container [skip ci]
* Fix link to pytest chapter
* Add link to Google Test [skip ci]
* Set PYTHONPATH [skip ci]
* Revise test_python.sh for running tests locally
* Update test_python.sh
* Place Docker recommendation notice in a prominent place [skip ci]
* Initial performance optimizations for xgboost
* remove includes
* revert float->double
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* Check existence of _mm_prefetch and __builtin_prefetch
* Fix lint
* optimizations for CPU
* appling comments in review
* add some comments, code refactoring
* fixing issues in CI
* adding runtime checks
* remove 1 extra check
* remove extra checks in BuildHist
* remove checks
* add debug info
* added debug info
* revert changes
* added comments
* Apply suggestions from code review
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* apply review comments
* Remove unused function CreateNewNodes()
* Add descriptive comment on node_idx variable in QuantileHistMaker::Builder::BuildHistsBatch()
* Implement tree model dump with a code generator.
* Split up generators.
* Implement graphviz generator.
* Use pattern matching.
* [Breaking] Return a Source in `to_graphviz` instead of Digraph in Python package.
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* - do not create device vectors for the entire sparse page while computing histograms...
- while creating the compressed histogram indices, the row vector is created for the entire
sparse page batch. this is needless as we only process chunks at a time based on a slice
of the total gpu memory
- this pr will allocate only as much as required to store the ppropriate row indices and the entries
* - do not dereference row_ptrs once the device_vector has been created to elide host copies of those counts
- instead, grab the entry counts directly from the sparsepage
* - set the appropriate device before freeing device memory...
- pr #4532 added a global memory tracker/logger to keep track of number of (de)allocations
and peak memory usage on a per device basis.
- this pr adds the appropriate check to make sure that the (de)allocation counts and memory usages
makes sense for the device. since verbosity is typically increased on debug/non-retail builds.
* - pre-create cub allocators and reuse them
- create them once and not resize them dynamically. we need to ensure that these allocators
are created and destroyed exactly once so that the appropriate device id's are set
This is part 1 of refactoring configuration.
* Move tree heuristic configurations.
* Split up declarations and definitions for GBTree.
* Implement UseGPU in gbm.
* - training with external memory - part 2 of 2
- when external memory support is enabled, building of histogram indices are
done incrementally for every sparse page
- the entire set of input data is divided across multiple gpu's and the relative
row positions within each device is tracked when building the compressed histogram buffer
- this was tested using a mortgage dataset containing ~ 670m rows before 4xt4's could be
saturated
* Fix C++11 config parser
* Use raw strings to improve readability of regex
* Fix compilation for GCC 5.x
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
* simplify the config.h file
* revise config.h
* revised config.h
* revise format
* revise format issues
* revise whitespace issues
* revise whitespace namespace format issues
* revise namespace format issues
* format issues
* format issues
* format issues
* format issues
* Revert submodule changes
* minor change
* Update src/common/config.h
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* address format issue from trivialfis
* Use correct cub submodule
* - training with external memory part 1 of 2
- this pr focuses on computing the quantiles using multiple gpus on a
dataset that uses the external cache capabilities
- there will a follow-up pr soon after this that will support creation
of histogram indices on large dataset as well
- both of these changes are required to support training with external memory
- the sparse pages in dmatrix are taken in batches and the the cut matrices
are incrementally built
- also snuck in some (perf) changes related to sketches aggregation amongst multiple
features across multiple sparse page batches. instead of aggregating the summary
inside each device and merged later, it is aggregated in-place when the device
is working on different rows but the same feature
* Only define `gpu_id` and `n_gpus` in `LearnerTrainParam`
* Pass LearnerTrainParam through XGBoost vid factory method.
* Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively).
* Test learner train param io.
* Fix gpu pickling.
* - fix issues with training with external memory on cpu
- use the batch size to determine the correct number of rows in a batch
- use the right number of threads in omp parallalization if the batch size
is less than the default omp max threads (applicable for the last batch)
* - handle scenarios where last batch size is < available number of threads
- augment tests such that we can test all scenarios (batch size <, >, = number of threads)
* adding support for matrix slicing with query ID for cross-validation
* hail mary test of unrar installation for windows tests
* trying to modify tests to run in Github CI
* Remove dependency on wget and unrar
* Save error log from R test
* Relax assertion in test_training
* Use int instead of bool in C function interface
* Revise R interface
* Add XGDMatrixSliceDMatrixEx and keep old XGDMatrixSliceDMatrix for API compatibility
* Add CMake option to use bundled gtest from dmlc-core, so that it is easy to build XGBoost with gtest on Windows
* Consistently apply OpenMP flag to all targets. Force enable OpenMP when USE_CUDA is turned on.
* Insert vcomp140.dll into Windows wheels
* Add C++ and Python tests for CPU and GPU targets (CUDA 9.0, 10.0, 10.1)
* Prevent spurious msbuild failure
* Add GPU tests
* Upgrade dmlc-core
* Fix#4462: Use /MT flag consistently for MSVC target
* First attempt at Windows CI
* Distinguish stages in Linux and Windows pipelines
* Try running CMake in Windows pipeline
* Add build step
* Automatically set maximize_evaluation_metrics if not explicitly given.
* When custom_eval is set, require maximize_evaluation_metrics.
* Update documents on early stop in XGBoost4J-Spark.
* Fix code error.
* Make CMakeLists.txt compatible with CMake 3.3; require CMake 3.11 for MSVC
* Use CMake 3.12 when sanitizer is enabled
* Disable funroll-loops for MSVC
* Use cmake version in container name
* Add missing arg
* Fix egrep use in ci_build.sh
* Display CMake version
* Do not set OpenMP_CXX_LIBRARIES for MSVC
* Use cmake_minimum_required()
* Use feature interaction constraints to narrow search space for split candidates.
* fix clang-tidy broken at updater_quantile_hist.cc:535:3
* make const
* fix
* try to fix exception thrown in java_test
* fix suspected mistake which cause EvaluateSplit error
* try fix
* Fix bug: feature ID and node ID swapped in argument
* Rename CheckValidation() to CheckFeatureConstraint() for clarity
* Do not create temporary vector validFeatures, to enable parallelism
* Combine thread launches into single launch per tree for gpu_hist
algorithm.
* Address deprecation warning
* Add manual column sampler constructor
* Turn off omp dynamic to get a guaranteed number of threads
* Enable openmp in cuda code
* All Linux tests are now in Jenkins CI
* Tests are now de-coupled from builds. We can now build XGBoost with one version of CUDA/JDK and test it with another version of CUDA/JDK
* Builds (compilation) are significantly faster because 1) They use C5 instances with faster CPU cores; and 2) build environment setup is cached using Docker containers
* fix the nan and non-zero missing value handling
* fix nan handling part
* add missing value
* Update MissingValueHandlingSuite.scala
* Update MissingValueHandlingSuite.scala
* stylistic fix
* [jvm-packages][hot-fix] fix column mismatch caused by zip actions at XGBooostModel.transformInternal
* apply minibatch in prediction
* an iterator-compatible minibatch prediction
* regressor impl
* continuous working on mini-batch prediction of xgboost4j-spark
* Update Booster.java
* Refactor CMake scripts.
* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.
* Revert machine.conf.
* Move CLI test to gpu.
* Small cleanup.
* Support using XGBoost as submodule.
* Fix windows
* Fix cpp tests on Windows
* Remove duplicated find_package.
* [r-package] cut CI-time dependency on craigcitro/r-travis (fixes#4348)
* Install R
* Install R on OSX
* Remove gfortran symlink
* Specify CRAN repo
* added more R dependencies needed for testing
* removed heavy R dependencies in CI
* fixed bug in env var, removed unnecessary apt installs of R
* fix to R installs
The old NativeLibLoader had a short-circuit load path which modified
java.library.path and attempted to load the xgboost library from outside
the jar first, falling back to loading the library from inside the jar.
This path is a no-op every time when using XGBoost outside of it's
source tree. Additionally it triggers an illegal reflective access
warning in the module system in 9, 10, and 11.
On Java 12 the ClassLoader fields are not accessible via reflection
(separately from the illegal reflective acces warning), and so it fails
in a way that isn't caught by the code which falls back to loading the
library from inside the jar.
This commit removes that code path and always loads the xgboost library
from inside the jar file as it's a valid technique across multiple JVM
implementations and works with all versions of Java.
* Fix Histogram allocation.
nidx_map is cleared after `Reset`, but histogram data size isn't changed hence
histogram recycling is used in later iterations. After a reset(building new
tree), newly allocated node will start from 0, while recycling always choose
the node with smallest index, which happens to be our newly allocated node 0.
* When building pull requests, use Docker cache for master branch
Docker build caches are per-branch, so new pull requests will initially
have no build cache, causing the Docker containers to be built from
scratch. New pull requests should use the cache associated with the
master branch. This makes sense, since most pull requests do not modify
the Dockerfile.
* Add comments
* make the assignments of HostDeviceVector exception safe.
* storing a dummy GPUDistribution instance in HDV for CPU based code.
* change testxgboost binary location to build directory.
* Make train in xgboost4j respect print params
Previously no setting in params argument of Booster::train would prevent
the Rabit.trackerPrint call. This can fill up a lot of screen space in
the case that many folds are being trained.
* Setting "silent" in this map to "true", "True", a non-zero integer, or
a string that can be parsed to such an int will prevent printing.
* Setting "verbose_eval" to "False" or "false" will prevent printing.
* Setting "verbose_eval" to an int (or a String parseable to an int) n
will result in printing every n steps, or no printing is n is zero.
This is to match the python behaviour described here:
https://www.kaggle.com/c/rossmann-store-sales/discussion/17499
* Fixed 'slient' typo in xgboost4j test
* private access on two methods
* Optimisations for gpu_hist.
* Use streams to overlap operations.
* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
* Brought the silent parameter for the SKLearn-like API back, marked it deprecated.
- added deprecation notice and warning
- removed silent from the tests for the SKLearn-like API
* Improved multi-node multi-GPU random forests.
- removed rabit::Broadcast() from each invocation of column sampling
- instead, syncing the PRNG seed when a ColumnSampler() object is constructed
- this makes non-trivial column sampling significantly faster in the distributed case
- refactored distributed GPU tests
- added distributed random forests tests
* Upgrade gtest for clang-tidy.
* Use CMake to install GTest instead of mv.
* Don't enforce clang-tidy to return 0 due to errors in thrust.
* Add a small test for tidy itself.
* Reformat.
* Added SKLearn-like random forest Python API.
- added XGBRFClassifier and XGBRFRegressor classes to SKL-like xgboost API
- also added n_gpus and gpu_id parameters to SKL classes
- added documentation describing how to use xgboost for random forests,
as well as existing caveats
* Initial commit to support multi-node multi-gpu xgboost using dask
* Fixed NCCL initialization by not ignoring the opg parameter.
- it now crashes on NCCL initialization, but at least we're attempting it properly
* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers
* Synchronizing in a couple of more places.
- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places
* Added another missing max-allreduce operation inside BuildHistLeftRight
* Removed unnecessary collective operations.
* Simplified rabit::Allreduce() sync of gradient sums.
* Removed unnecessary rabit syncs around ncclAllReduce.
- this improves performance _significantly_ (7x faster for overall training,
20x faster for xgboost proper)
* pulling in latest xgboost
* removing changes to updater_quantile_hist.cc
* changing use_nccl_opg initialization, removing unnecessary if statements
* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId
* placing struct defintion in guard to avoid duplicate code errors
* addressing linting errors
* removing
* removing additional arguments to AllReduer initialization
* removing distributed flag
* making comm init symmetric
* removing distributed flag
* changing ncclCommInit to support multiple modalities
* fix indenting
* updating ncclCommInitRank block with necessary group calls
* fix indenting
* adding print statement, and updating accessor in vector
* improving print statement to end-line
* generalizing nccl_rank construction using rabit
* assume device_ordinals is the same for every node
* test, assume device_ordinals is identical for all nodes
* test, assume device_ordinals is unique for all nodes
* changing names of offset variable to be more descriptive, editing indenting
* wrapping ncclUniqueId GetUniqueId() and aesthetic changes
* adding synchronization, and tests for distributed
* adding to tests
* fixing broken #endif
* fixing initialization of gpu histograms, correcting errors in tests
* adding to contributors list
* adding distributed tests to jenkins
* fixing bad path in distributed test
* debugging
* adding kubernetes for distributed tests
* adding proper import for OrderedDict
* adding urllib3==1.22 to address ordered_dict import error
* added sleep to allow workers to save their models for comparison
* adding name to GPU contributors under docs
* Fix early stop with xgboost4j-spark
* Update XGBoost.java
* Update XGBoost.java
* Update XGBoost.java
To use -Float.MAX_VALUE as the lower bound, in case there is positive metric.
* Only update best score if the current score is better (no update when equal)
* Update xgboost-spark tutorial to fix early stopping docs.
* Fix test_gpu_coordinate.
* Use `gpu_coord_descent` in test.
* Reduce number of running rounds.
* Remove nthread.
* Use githubusercontent for r-appveyor.
* Use githubusercontent in travis r tests.
* Prevent empty quantiles
* Revise and improve unit tests for quantile hist
* Remove unnecessary comment
* Add #2943 as a test case
* Skip test if no sklearn
* Revise misleading comments
* Add checks for group size.
* Simple docs.
* Search group index during hist cut matrix initialization.
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Fix broken R test: Install Homebrew GCC
Missing GCC Fortran causes installation failure of a dependency package
(igraph)
* Register gfortran system-wide
* Use correct keg
* Set env vars to change compiler choice
* Do not break other Mac builds
* Nuclear option: symlink gfortran
* Use /usr/local/bin instead of /usr/bin
* Symlink library path too
* Update run_test.sh
* Remove GHistRow, GHistEntry, GHistIndexRow.
* Remove kSimpleStats.
* Remove CheckInfo, SetLeafVec in GradStats and in SKStats.
* Clean up the GradStats.
* Cleanup calcgain.
* Move LossChangeMissing out of common.
* Remove [] operator from GHistIndexBlock.
* Basic script for using compilation database.
* Add `GENERATE_COMPILATION_DATABASE' to CMake.
* Rearrange CMakeLists.txt.
* Add basic python clang-tidy script.
* Remove modernize-use-auto.
* Add clang-tidy to Jenkins
* Refine logic for correct path detection
In Jenkins, the project root is of form /home/ubuntu/workspace/xgboost_PR-XXXX
* Run clang-tidy in CUDA 9.2 container
* Use clang_tidy container
* Enable xgb_model parameter in XGClassifier scikit-learn API
https://github.com/dmlc/xgboost/issues/3049
* add test_XGBClassifier_resume():
test for xgb_model parameter in XGBClassifier API.
* Update test_with_sklearn.py
* Fix lint
* Fix failing Travis CI on Mac
Use Homebrew Addon + latest Mac image
* Use long command for pytest
* Downgrade OSX image to xcode9.3, to use Java 8
* Install pytest in Python 2 environment
* Remove clang-tidy from Travis
- ./testxgboost (without filters) failed if run on a multi-GPU machine because
the memory was allocated on the current device, but device 0
was always passed into LaunchN
* Initial performance optimizations for xgboost
* remove includes
* revert float->double
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* fix for CI
* Check existence of _mm_prefetch and __builtin_prefetch
* Fix lint
* Updates to Booster to support other feature importances
* Add returns for Java methods
* Pass Scala style checks
* Pass Java style checks
* Fix indents
* Use class instead of enum
* Return map string double
* A no longer broken build, thanks to mvn package local build
* Add a unit test to increase code coverage back
* Address code review on main code
* Add more unit tests for different feature importance scores
* Address more CR
* Use Span in gpu coordinate.
* Use Span in device code.
* Fix shard size calculation.
- Use lower_bound instead of upper_bound.
* Check empty devices.
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* wrap iterators
* enable copartition training and validationset
* add parameters
* converge code path and have init unit test
* enable multi evals for ranking
* unit test and doc
* update example
* fix early stopping
* address the offline comments
* udpate doc
* test eval metrics
* fix compilation issue
* fix example
* Unify logging facilities.
* Enhance `ConsoleLogger` to handle different verbosity.
* Override macros from `dmlc`.
* Don't use specialized gamma when building with GPU.
* Remove verbosity cache in monitor.
* Test monitor.
* Deprecate `silent`.
* Fix doc and messages.
* Fix python test.
* Fix silent tests.
* Ensure lists cannot be passed into DMatrix
The documentation does not include lists as an allowed type for the data inputted into DMatrix. Despite this, a list can be passed in without an error. This change would prevent a list form being passed in directly.
* update description of early stopping rounds
the description of early stopping round was quite inconsistent in the scikit-learn api section since the fit paragraph tells that when early stopping rounds occurs, the last iteration is returned not the best one, but the predict paragraph tells that when the predict is called without ntree_limit specified, then ntree_limit is equals to best_ntree_limit.
Thus, when reading the fit part, one could think that it is needed to specify what is the best iter when calling the predict, but when reading the predict part, then the best iter is given by default, it is the last iter that you have to specify if needed.
* Update sklearn.py
* Update sklearn.py
fix doc according to the python_lightweight_test error
* Port elementwise metrics to GPU.
* All elementwise metrics are converted to static polymorphic.
* Create a reducer for metrics reduction.
* Remove const of Metric::Eval to accommodate CubMemory.
- Improved GPU performance logging
- Only use one execute shards function
- Revert performance regression on multi-GPU
- Use threads to launch NCCL AllReduce
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* update version
* 0.82
* fix early stopping condition
* remove unused
* update comments
* udpate comments
* update test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* update version
* 0.82
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* wrap iterators
* remove unused code
* refactor
* fix typo
* use gain for sklearn feature_importances_
`gain` is a better feature importance criteria than the currently used `weight`
* added importance_type to class
* fixed test
* white space
* fix variable name
* fix deprecation warning
* fix exp array
* white spaces
* Enable running objectives with 0 GPU.
* Enable 0 GPU for objectives.
* Add doc for GPU objectives.
* Fix some objectives defaulted to running on all GPUs.
* Make C++ unit tests run and pass on Windows
* Fix logic for external memory. The letter ':' is part of drive letter,
so remove the drive letter before splitting on ':'.
* Cosmetic syntax changes to keep MSVC happy.
* Fix lint
* Add Windows guard
* Fix#3342 and h2oai/h2o4gpu#625: Save predictor parameters in model file
This allows pickled models to retain predictor attributes, such as
'predictor' (whether to use CPU or GPU) and 'n_gpu' (number of GPUs
to use). Related: h2oai/h2o4gpu#625Closes#3342.
TODO. Write a test.
* Fix lint
* Do not load GPU predictor into CPU-only XGBoost
* Add a test for pickling GPU predictors
* Make sample data big enough to pass multi GPU test
* Update test_gpu_predictor.cu
* Fix#3747: Add coef_ and intercept_ as properties of sklearn wrapper
Scikit-learn expects linear learners to expose `coef_` and `intercept_`
as properties.
Closes#3747.
* Fix lint
* Clean up logic for converting tree_method to updater sequence
* Use C++11 enum class for extra safety
Compiler will give warnings if switch statements don't handle all
possible values of C++11 enum class.
Also allow enum class to be used as DMLC parameter.
* Fix compiler error + lint
* Address reviewer comment
* Better docstring for DECLARE_FIELD_ENUM_CLASS
* Fix lint
* Add C++ test to see if tree_method is recognized
* Fix clang-tidy error
* Add test_learner.h to R package
* Update comments
* Fix lint error
* fix error in dmlc#57, clean up comments and naming
* include missing packages, disable recovery tests for now
* disable local_recover tests until we have a bug fix
* support larger cluster
* fix lint, merge with master
* fix mac osx test failure in https://github.com/dmlc/xgboost/pull/3818
* Update allreduce_robust.cc
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* documenting tracker
* Make it a separate note
The `save_model()` and `load_model()` method only saves the part of the model
that's common to all language interfaces and do not preserve Python-specific
attributes, such as `feature_names`. More crucially, label encoder is not
preserved either; this is needed for the scikit-learn wrapper, since you may
have string labels.
Fix: Explicitly recommend pickling as the way to save scikit-learn model
objects.
* Multi-GPU support in GPUPredictor.
- GPUPredictor is multi-GPU
- removed DeviceMatrix, as it has been made obsolete by using HostDeviceVector in DMatrix
* Replaced pointers with spans in GPUPredictor.
* Added a multi-GPU predictor test.
* Fix multi-gpu test.
* Fix n_rows < n_gpus.
* Reinitialize shards when GPUSet is changed.
* Tests range of data.
* Remove commented code.
* Remove commented code.
* Enable auto-locking of issues closed long ago
Issues that were closed more than 90 days ago will be locked automatically so
that no additional comments would be allowed. We will use a bot to do
this: https://probot.github.io/apps/lock/
Background: As a maintainer, I often see people leaving comments to old issue
posts that were closed long ago. Those comments are hard to discover and assist
with, since they get buried under list of other active issues.
With the change, users who want to follow up with an old issue would be asked
to file a new issue.
* Exempt `feature-request` from auto locking
* Disable comment to avoid triggering notification
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* temp
* add method for classifier and regressor
* update tutorial
* address the comments
* update
A privilege escalation vulnerability (CVE-2017-15288) has been
identified in the Scala compilation daemon. See
https://nvd.nist.gov/vuln/detail/CVE-2017-15288
Fix: Upgrade Scala to 2.11.12.
* fix error in dmlc#57, clean up comments and naming
* include missing packages, disable recovery tests for now
* disable local_recover tests until we have a bug fix
* support larger cluster
* fix lint, merge with master
**Symptom** Apple Clang's implementation of `std::shuffle` expects doesn't work
correctly when it is run with the random bit generator for R package:
```cpp
CustomGlobalRandomEngine::result_type
CustomGlobalRandomEngine::operator()() {
return static_cast<result_type>(
std::floor(unif_rand() * CustomGlobalRandomEngine::max()));
}
```
Minimial reproduction of failure (compile using Apple Clang 10.0):
```cpp
std::vector<int> feature_set(100);
std::iota(feature_set.begin(), feature_set.end(), 0);
// initialize with 0, 1, 2, 3, ..., 99
std::shuffle(feature_set.begin(), feature_set.end(), common::GlobalRandom());
// This returns 0, 1, 2, ..., 99, so content didn't get shuffled at all!!!
```
Note that this bug is platform-dependent; it does not appear when GCC or
upstream LLVM Clang is used.
**Diagnosis** Apple Clang's `std::shuffle` expects 32-bit integer
inputs, whereas `CustomGlobalRandomEngine::operator()` produces 64-bit
integers.
**Fix** Have `CustomGlobalRandomEngine::operator()` produce 32-bit integers.
Closes#3523.
* Split building histogram into separated class.
* Extract `InitCompressedRow` definition.
* Basic tests for gpu-hist.
* Document the code more verbosely.
* Removed `HistCutUnit`.
* Removed some duplicated copies in `GPUHistMaker`.
* Implement LCG and use it in tests.
* Added some instructions on using MinGW-built XGBoost with python.
* Changes according to the discussion and some additions
* Fixed wording and removed redundancy.
* Even more fixes
* Fixed links. Removed redundancy.
* Some fixes according to the discussion
* fixes
* Some fixes
* fixes
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* sparjJobThread
* update
* fix issue when spark job execution thread cannot return before we execute first()
* Implement Transform class.
* Add tests for softmax.
* Use Transform in regression, softmax and hinge objectives, except for Cox.
* Mark old gpu objective functions deprecated.
* static_assert for softmax.
* Split up multi-gpu tests.
* DMatrix refactor 2
* Remove buffered rowset usage where possible
* Transition to c++11 style iterators for row access
* Transition column iterators to C++ 11
* Add multi-GPU unit test environment
* Better assertion message
* Temporarily disable failing test
* Distinguish between multi-GPU and single-GPU CPP tests
* Consolidate Python tests. Use attributes to distinguish multi-GPU Python tests from single-CPU counterparts
* Fix#3730: scikit-learn 0.20 compatibility fix
sklearn.cross_validation has been removed from scikit-learn 0.20,
so replace it with sklearn.model_selection
* Display test names for Python tests for clarity
* add back train method but mark as deprecated
* fix scalastyle error
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix scalastyle error
* instrumentation
* use log console
* better measurement
* fix erros in example
* update histmaker
* add a demo of multi-class classification R version
* add a demo of multi-class classification result
* add intro to the demo readme
* Delete train.md
* Update README.md
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* remove copy paste error
* added test, commented out right now
* reinstated test
* added fix for checking encryption settings
* fix by using RDD conf
* fix compilation
* renamed conf
* use SparkSession if available
* fix message
* nop
* code review fixes
* Fix#3397: early_stop callback does not maximize metric of form NDCG@n-
Early stopping callback makes splits with '-' letter, which interferes
with metrics of form NDCG@n-. As a result, XGBoost tries to minimize
NDCG@n-, where it should be maximized instead.
Fix. Specify maxsplit=1.
* Python 2.x compatibility fix
* Add scikit-learn tests
Goal is to pass scikit-learn's check_estimator() for XGBClassifier,
XGBRegressor, and XGBRanker. It is actually not possible to do so
entirely, since check_estimator() assumes that NaN is disallowed,
but XGBoost allows for NaN as missing values. However, it is always
good ideas to add some checks inspired by check_estimator().
* Fix lint
* Fix lint
* add interaction constraints
* enable both interaction and monotonic constraints at the same time
* fix lint
* add R test, fix lint, update demo
* Use dmlc::JSONReader to express interaction constraints as nested lists; Use sparse arrays for bookkeeping
* Add Python test for interaction constraints
* make R interaction constraints parameter based on feature index instead of column names, fix R coding style
* Fix lint
* Add BlueTea88 to CONTRIBUTORS.md
* Short circuit when no constraint is specified; address review comments
* Add tutorial for feature interaction constraints
* allow interaction constraints to be passed as string, remove redundant column_names argument
* Fix typo
* Address review comments
* Add comments to Python test
- previously, vec_ in DeviceShard wasn't updated on copy; as a result,
the shards continued to refer to the old HostDeviceVectorImpl object,
which resulted in a dangling pointer once that object was deallocated
* Fix#3648: XGBClassifier.predict() should return margin scores when output_margin=True
* Fix tests to reflect correct implementation of XGBClassifier.predict(output_margin=True)
* Fix flaky test test_with_sklearn.test_sklearn_api_gblinear
* Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage.
- added distributions to HostDeviceVector
- using HostDeviceVector for labels, weights and base margings in MetaInfo
- using HostDeviceVector for offset and data in SparsePage
- other necessary refactoring
* Added const version of HostDeviceVector API calls.
- const versions added to calls that can trigger data transfers, e.g. DevicePointer()
- updated the code that uses HostDeviceVector
- objective functions now accept const HostDeviceVector<bst_float>& for predictions
* Updated src/linear/updater_gpu_coordinate.cu.
* Added read-only state for HostDeviceVector sync.
- this means no copies are performed if both host and devices access
the HostDeviceVector read-only
* Fixed linter and test errors.
- updated the lz4 plugin
- added ConstDeviceSpan to HostDeviceVector
- using device % dh::NVisibleDevices() for the physical device number,
e.g. in calls to cudaSetDevice()
* Fixed explicit template instantiation errors for HostDeviceVector.
- replaced HostDeviceVector<unsigned int> with HostDeviceVector<int>
* Fixed HostDeviceVector tests that require multiple GPUs.
- added a mock set device handler; when set, it is called instead of cudaSetDevice()
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* interrupted exception is not rethrown
* Add XGBRanker to Python API doc
* Show inherited members of XGBRegressor in API doc, since XGBRegressor uses default methods from XGBModel
* Add table of contents to Python API doc
* Skip JVM doc download if not available
* Show inherited members for XGBRegressor and XGBRanker
* Expose XGBRanker to Python XGBoost module directory
* Add docstring to XGBRegressor.predict() and XGBRanker.predict()
* Fix rendering errors in Python docstrings
* Fix lint
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix update checkpoint func
* added xgbranker
* fixed predict method and ranking test
* reformatted code in accordance with pep8
* fixed lint error
* fixed docstring and added checks on objective
* added ranking demo for python
* fixed suffix in rank.py
* Add basic Span class based on ISO++20.
* Use Span<Entry const> instead of Inst in SparsePage.
* Add DeviceSpan in HostDeviceVector, use it in regression obj.
This pull request amends the broken #3062 allow Spark 2.2 to work.
Please note this won't work in Spark <=2.1 as sc.removeSparkListener was implemented in Spark 2.2. (So perhaps a more general method is better, although that is what was attempted in #3062)
This PR fixes: #3208, #3151 and the discussion in #1927.
I do find it strange that #3062 dose not work in Spark 2.2, it's probably due to some sort of public/private issue in the org.apache.spark.scheduler.LiveListenerBus class inheritance (In Spark itself). The error is: `java.lang.NoSuchMethodError: org.apache.spark.scheduler.LiveListenerBus.removeListener(Ljava/lang/Object;)V`
* Adding Java/Scala doc build to Jenkins CI
* Deploy built doc to S3 bucket
* Build doc only for branches
* Build doc first, to get doc faster for branch updates
* Have ReadTheDocs download doc tarball from S3
* Update JVM doc links
* Put doc build commands in a script
* Specify Spark 2.3+ requirement for XGBoost4J-Spark
* Build GPU wheel without NCCL, to reduce binary size
* Revert "Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)"
This reverts commit 44811f2330.
* Document behavior of predict() for DART booster
* Add notice to parameter.rst
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* partial finish
* no test
* add test cases
* add test cases
* address comments
* add test for regressor
* fix typo
* Fix#3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows
Description: The bug is triggered when
1. The data matrix has empty rows at the bottom. More precisely, the rows
`n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.
Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.
Fix: Modify the row pointer.
* Add regression test
The base margin will need to have length `[num_class] * [number of data points]`.
Otherwise, the array holding prediction results will be only partially
initialized, causing undefined behavior.
Fix: check the length of the base margin. If the length is not correct,
use the global bias (`base_score`) instead. Warn the user about the
substitution.
* Fix#3402: wrong fid crashes distributed algorithm
The bug was introduced by the recent DMatrix refactor (#3301). It was partially
fixed by #3408 but the example in #3402 was still failing. The example in #3402
will succeed after this fix is applied.
* Explicitly specify "this" to prevent compile error
* Add regression test
* Add distributed test to Travis matrix
* Install kubernetes Python package as dependency of dmlc tracker
* Add Python dependencies
* Add compile step
* Reduce size of regression test case
* Further reduce size of test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add new
* update doc
* finish Gang Scheduling
* more
* intro
* Add sections: Prediction, Model persistence and ML pipeline.
* Add XGBoost4j-Spark MLlib pipeline example
* partial finished version
* finish the doc
* adjust code
* fix the doc
* use rst
* Convert XGBoost4J-Spark tutorial to reST
* Bring XGBoost4J up to date
* add note about using hdfs
* remove duplicate file
* fix descriptions
* update doc
* Wrap HDFS/S3 export support as a note
* update
* wrap indexing_mode example in code block
This bring many goodies, including:
* Ability to specify delimiter and weight_column for CSV files:
```python
dtrain = xgboost.DMatrix('train.csv?format=csv&label_column=0&weight_column=1&delimiter= ')
```
* Ability to choose between 0-based and 1-based indexing for LIBSVM/LIBFM files:
```python
dtrain = xgboost.DMatrix('train.libsvm?indexing_mode=1') # use 1-based indexing
dtest = xgboost.DMatrix('test.libsvm') # use 0-based indexing (default)
dtest2 = xgboost.DMatrix('test2.libsvm?indexing_mode=-1') # use heuristic to detect 0-based / 1-based
```
* Fix a bug in float parsing (issue dmlc/dmlc-core#440)
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider spark.task.cpus when controlling parallelism
* fix bug
* fix conf setup
* calculate requestedCores within ParallelismController
* enforce spark.task.cpus = 1
* unify unit test case framework
* enable spark ui
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider missing value in prediction
* handle single prediction instance
* fix type conversion
* Fix bug of using list(x) function when x is string
list('abcdcba') = ['a', 'b', 'c', 'd', 'c', 'b', 'a']
* Allow feature_names/feature_types to be of any type
If feature_names/feature_types is iterable, e.g. tuple, list, then convert the value to list, except for string; otherwise construct a list with a single value
* Delete excess whitespace
* Fix whitespace to pass lint
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)
* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.
* Fix appveyor R test
* Save max_delta_step as an extra attribute of Booster
Fixes#3509 and #3026, where `max_delta_step` parameter gets lost during serialization.
* fix lint
* Use camel case for global constant
* disable local variable case in clang-tidy
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
Add `'total_gain'` and `'total_cover'` as possible `importance_type`
arguments to `Booster.get_score` in the Python package.
`get_score` already accepts a `'gain'` argument, which returns each
feature's average gain over all of its splits. `'total_gain'` does the
same, but returns a total rather than an average. This seems more
intuitively meaningful, and also matches the behavior of the R package's
`xgb.importance` function.
I also added an analogous `'total_cover'` command for consistency.
This should resolve#3484.
* Improved library loading a bit
* Fixed indentation.
* Fixes according to the discussion
* Moved the comment to a separate line.
* specified exception type
* Change doc build to reST exclusively
* Rewrite Intro doc in reST; create toctree
* Update parameter and contribute
* Convert tutorials to reST
* Convert Python tutorials to reST
* Convert CLI and Julia docs to reST
* Enable markdown for R vignettes
* Done migrating to reST
* Add guzzle_sphinx_theme to requirements
* Add breathe to requirements
* Fix search bar
* Add link to user forum
* Fail GPU CI after test failure
* Fix GPU linear tests
* Reduced number of GPU tests to speed up CI
* Remove static allocations of device memory
* Resolve illegal memory access for updater_fast_hist.cc
* Fix broken r tests dependency
* Update python install documentation for GPU
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* disable booster setup in spark
* check in parameter conversion
* fix compilation issue
* update exception type
* add qid for https://github.com/dmlc/xgboost/issues/2748
* change names
* change spaces
* change qid to bst_uint type
* change qid type to size_t
* change qid first to SIZE_MAX
* change qid type from size_t to uint64_t
* update dmlc-core
* fix qids name error
* fix group_ptr_ error
* Style fix
* Add qid handling logic to SparsePage
* New MetaInfo format + backward compatibility fix
Old MetaInfo format (1.0) doesn't contain qid field. We still want to be able
to read from MetaInfo files saved in old format. Also, define a new format
(2.0) that contains the qid field. This way, we can distinguish files that
contain qid and those that do not.
* Update MetaInfo test
* Simply group assignment logic
* Explicitly set qid=nullptr in NativeDataIter
NativeDataIter's callback does not support qid field. Users of NativeDataIter
will need to call setGroup() function separately to set group information.
* Save qids_ in SaveBinary()
* Upgrade dmlc-core submodule
* Add a test for reading qid
* Add contributor
* Check the size of qids_
* Document qid format
* allow arbitrary cross validation fold indices
- use training indices passed to `folds` parameter in `training.cv`
- update doc string
* add tests for arbitrary fold indices
* Refactor to allow for custom regularisation methods
* Implement compositional SplitEvaluator framework
* Fixed segfault when no monotone_constraints are supplied.
* Change pid to parentID
* test_monotone_constraints.py now passes
* Refactor ColMaker and DistColMaker to use SplitEvaluator
* Performance optimisation when no monotone_constraints specified
* Fix linter messages
* Fix a few more linter errors
* Update the amalgamation
* Add bounds check
* Add check for leaf node
* Fix linter error in param.h
* Fix clang-tidy errors on CI
* Fix incorrect function name
* Fix clang-tidy error in updater_fast_hist.cc
* Enable SSE2 for Win32 R MinGW
Addresses https://github.com/dmlc/xgboost/pull/3335#issuecomment-400535752
* Add contributor
CI tests were failing because wget prompts "the user" for a response
whenever the google test archive is already on the disk.
Fix: Use `-nc` option to skip download when the archive already
exists
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* maven central release
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* [jvm-packages] XGBoost Spark integration refactor. (#3313)
* XGBoost Spark integration refactor.
* Make corresponding update for xgboost4j-example
* Address comments.
* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)
* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib
* Fix extra space.
* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)
* XGBoost Spark supports ranking with group data.
* Use Iterator.duplicate to prevent OOM.
* Update CheckpointManagerSuite.scala
* Resolve conflicts
* Use sparse page as singular CSR matrix representation
* Simplify dmatrix methods
* Reduce statefullness of batch iterators
* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
* GPU binning and compression.
- binning and index compression are done inside the DeviceShard constructor
- in case of a DMatrix with multiple row batches, it is first converted into a single row batch
Currently, `CLIPredict()` saves prediction results in default 6-digit precision which causes precision loss. This PR sets precision to a level so that the conversion back to `bst_float` is lossless.
Related: #3298.
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update 0.80
* Fix print.xgb.Booster
valid_handle should be TRUE when x$handle is NOT null
* Update xgb.Booster.R
Modify is.null.handle to return TRUE for NULL handle
* Add option to use weights when evaluating metrics in validation sets
* Add test for validation-set weights functionality
* simplify case with no weights for test sets
* fix lint issues
* For CRAN submission, remove all #pragma's that suppress compiler warnings
A few headers in dmlc-core contain #pragma's that disable compiler warnings,
which is against the CRAN submission policy. Fix the problem by removing
the offending #pragma's as part of the command `make Rbuild`.
This addresses issue #3322.
* Fix script to improve Cygwin/MSYS compatibility
We need this to pass rmingw CI test
* Remove remove_warning_suppression_pragma.sh from packaged tarball
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* static glibc glibc++
* update to build with glib 2.12
* remove unsupported flags
* update version number
* remove properties
* remove unnecessary command
* update poms
* Update dmlc-core submodule
* Fix dense_parser to work with the latest dmlc-core
* Specify location of Google Test
* Add more source files in dmlc-minimum to get latest dmlc-core working
* Update dmlc-core submodule
* Adjust xgboost entries in .gitignore
They were overly broad. In particularly this was inconvenient when
working with tools such as fzf that use the .gitignore to decide what to
include. As written, we'd not look into /include/xgboost.
* Make cosmetic improvements to .gitignore
* Remove dmlc-core from .gitignore
This seems unnecessary and has the drawback that tools that use
.gitignore to know files to skip mean they won't look here, and being
able to inspect the submodule files with them is useful.
* Increase precision of bst_float values in tree dumps
* Increase precision of bst_float values in tree dumps
* Fix lint error and switch precision to right float variable
* Fix clang-tidy error
* Multi-GPU HostDeviceVector.
- HostDeviceVector instances can now span multiple devices, defined by GPUSet struct
- the interface of HostDeviceVector has been modified accordingly
- GPU objective functions are now multi-GPU
- GPU predicting from cache is now multi-GPU
- avoiding omp_set_num_threads() calls
- other minor changes
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* change version of jvm to keep consistent with other pkgs
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update default spark version to 2.3
Without this, with gcc 7.3.0, we see things like:
/xgboost/include/xgboost/c_api.h:98:1: error: function
declaration isn't a prototype [-Werror=strict-prototypes]
XGB_DLL const char *XGBGetLastError();
^~~~~~~
In some cases, users may not want to have any global replica of
the data being broadcasted/all-reduced. In such cases, set the
result_buffer_round to -1 as a flag that this is not necessary
and check for it.
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
It shouldn't be an assert because it shutdowns the process. Instead should check on the value and return some sort of error, so that we can recover.
The mock contains queues, indexed by the rank of the process. For each node, you can configure the behavior you expect (success or failure for now) when you call any of the methods (AllReduce, Broadcast, LoadCheckPoint and CheckPoint)... If you call several times AllReduce, the outputs will pop from the queue, i.e., first you can retrieve a success, then a failure and so on.
Pretty basic for now, need to tune it better
Thanks for participating in the XGBoost community! We use https://discuss.xgboost.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first :)
Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.
For bug reports, to help the developer act on the issues, please include a description of your environment, preferably a minimum script to reproduce the problem.
For feature proposals, list clear, small actionable items so we can track the progress of the change.
XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users.
Project Management Committee(PMC)
----------
The Project Management Committee(PMC) consists group of active committers that moderate the discussion, manage the project release, and proposes new committer/PMC members.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a Ph.D. student working on large-scale machine learning. He is the creator of the project.
- Yuan is a founding engineer at Akuity. He contributed mostly in R and Python packages.
* [Nan Zhu](https://github.com/CodingCat), Uber
- Nan is a software engineer in Uber. He contributed mostly in JVM packages.
* [Jiaming Yuan](https://github.com/trivialfis)
- Jiaming contributed to the GPU algorithms. He has also introduced new abstractions to improve the quality of the C++ codebase.
* [Hyunsu Cho](http://hyunsu-cho.io/), NVIDIA
- Hyunsu is the maintainer of the XGBoost Python package. He also manages the Jenkins continuous integration system (https://xgboost-ci.net/). He is the initial author of the CPU 'hist' updater.
* [Rory Mitchell](https://github.com/RAMitchell), University of Waikato
- Rory is a Ph.D. student at University of Waikato. He is the original creator of the GPU training algorithms. He improved the CMake build system and continuous integration.
* [Hongliang Liu](https://github.com/phunterlau)
Committers
----------
Committers are people who have made substantial contribution to the project and granted write access to the project.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a PhD working on large-scale machine learning, he is the creator of the project.
* [Tong He](https://github.com/hetong007), Simon Fraser University
- Tong is a master student working on data mining, he is the maintainer of xgboost R package.
* [Tong He](https://github.com/hetong007), Amazon AI
- Tong is an applied scientist in Amazon AI. He is the maintainer of XGBoost R package.
-Sergei is a software engineer in Criteo. He contributed mostly in JVM packages.
* [Scott Lundberg](http://scottlundberg.com/), University of Washington
-Scott is a Ph.D. student at University of Washington. He is the creator of SHAP, a unified approach to explain the output of machine learning models such as decision tree ensembles. He also helps maintain the XGBoost Julia package.
#' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain.
#' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified
#' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified
#' by the values of outcome labels.
#' @param folds \code{list} provides a possibility to use a list of pre-defined CV folds
#' (each element must be a vector of test fold's indices). When folds are supplied,
#' (each element must be a vector of test fold's indices). When folds are supplied,
#' the \code{nfold} and \code{stratified} parameters are ignored.
#' @param train_folds \code{list} list specifying which indicies to use for training. If \code{NULL}
#' (the default) all indices not specified in \code{folds} will be used for training.
#' @param verbose \code{boolean}, print the statistics during the process
#' @param print_every_n Print each n-th iteration evaluation messages when \code{verbose>0}.
#' Default is 1 which means all messages are printed. This parameter is passed to the
#' Default is 1 which means all messages are printed. This parameter is passed to the
#' \code{\link{cb.print.evaluation}} callback.
#' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' doesn't improve for \code{k} rounds.
#' Setting this parameter engages the \code{\link{cb.early.stop}} callback.
#' @param maximize If \code{feval} and \code{early_stopping_rounds} are set,
@@ -60,100 +67,107 @@
#' When it is \code{TRUE}, it means the larger the evaluation score the better.
#' This parameter is passed to the \code{\link{cb.early.stop}} callback.
#' @param callbacks a list of callback functions to perform various task during boosting.
#' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
#' parameters' values. User can provide either existing or their own callback methods in order
#' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
#' parameters' values. User can provide either existing or their own callback methods in order
#' to customize the training process.
#' @param ... other parameters to pass to \code{params}.
#'
#' @details
#' The original sample is randomly partitioned into \code{nfold} equal size subsamples.
#'
#' Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
#'
#' The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
#'
#' All observations are used for both training and validation.
#'
#' Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
#'
#' @return
#' @details
#' The original sample is randomly partitioned into \code{nfold} equal size subsamples.
#'
#' Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model,
#' and the remaining \code{nfold - 1} subsamples are used as training data.
#'
#' The cross-validation process is then repeated \code{nrounds} times, with each of the
#' \code{nfold} subsamples used exactly once as the validation data.
#'
#' All observations are used for both training and validation.
#'
#' Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29}
#'
#' @return
#' An object of class \code{xgb.cv.synchronous} with the following elements:
#' \itemize{
#' \item \code{call} a function call.
#' \item \code{params} parameters that were passed to the xgboost library. Note that it does not
#' \item \code{params} parameters that were passed to the xgboost library. Note that it does not
#' capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
#' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitely passed.
#' \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to the
#' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitly passed.
#' \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to the
#' CV-based evaluation means and standard deviations for the training and test CV-sets.
#' It is created by the \code{\link{cb.evaluation.log}} callback.
#' \item \code{niter} number of boosting iterations.
#' \item \code{nfeatures} number of features in training data.
#' \item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
#' \item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
#' parameter or randomly generated.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the bestiteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{pred} CV prediction values available when \code{prediction} is set.
#' \item \code{best_ntreelimit} and the \code{ntreelimit} Deprecated attributes, use \code{best_iteration} instead.
#' \item \code{pred} CV prediction values available when \code{prediction} is set.
#' It is either vector or matrix (see \code{\link{cb.cv.predict}}).
#' \item \code{models} a liost of the CV folds' models. It is only available with the explicit
#' \item \code{models} a list of the CV folds' models. It is only available with the explicit
#' setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
#' \code{xgb.train} is an advanced interface for training an xgboost model.
#' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
#'
#' @param params the list of parameters.
#' The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}.
#' Below is a shorter summary:
#'
#' @param params the list of parameters. The complete list of parameters is
#' available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
#' is a shorter summary:
#'
#' 1. General Parameters
#'
#'
#' \itemize{
#' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
#' }
#'
#'
#' 2. Booster Parameters
#'
#' 2.1. Parameter for Tree Booster
#'
#'
#' 2.1. Parameters for Tree Booster
#'
#' \itemize{
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item{ \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1}
#' when it is added to the current approximation.
#' Used to prevent overfitting by making the boosting process more conservative.
#' Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model
#' more robust to overfitting but slower to compute. Default: 0.3}
#' \item{ \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree.
#' the larger, the more conservative the algorithm will be.}
#' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item\code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
#' \item{\code{min_child_weight} minimum sum of instance weight (hessian) needed in a child.
#' If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight,
#' then the building process will give up further partitioning.
#' In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node.
#' The larger, the more conservative the algorithm will be. Default: 1}
#' \item{ \code{subsample} subsample ratio of the training instance.
#' Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees
#' and this will prevent overfitting. It makes computation shorter (because less data to analyse).
#' It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1}
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
#' \item \code{lambda} L2 regularization term on weights. Default: 1
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
#' \item{ \code{num_parallel_tree} Experimental parameter. number of trees to grow per round.
#' \item{ \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length
#' equals to the number of features in the training data.
#' \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.}
#' \item{ \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions.
#' Each item of the list represents one permitted interaction where specified features are allowed to interact with each other.
#' Feature index values should start from \code{0} (\code{0} references the first column).
#' Leave argument unspecified for no interaction constraints.}
#' }
#'
#' 2.2. Parameter for Linear Booster
#'
#'
#' 2.2. Parameters for Linear Booster
#'
#' \itemize{
#' \item \code{lambda} L2 regularization term on weights. Default: 0
#' \item \code{lambda_bias} L2 regularization term on bias. Default: 0
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
#' }
#'
#' 3. Task Parameters
#'
#'
#' 3. Task Parameters
#'
#' \itemize{
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \item{ \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it.
#' The default objective options are below:
#' \itemize{
#' \item \code{reg:linear} linear regression (Default).
#' \item \code{reg:squarederror} Regression with squared loss (Default).
#' \item{ \code{reg:squaredlogerror}: regression with squared log loss \eqn{1/2 * (log(pred + 1) - log(label + 1))^2}.
#' All inputs are required to be greater than -1.
#' Also, see metric rmsle for possible issue with this objective.}
#' \item \code{reg:logistic} logistic regression.
#' \item \code{reg:pseudohubererror}: regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{num_class} set the number of classes. To use only with multiclass objectives.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}.
#' \item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{binary:hinge}: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
#' \item{ \code{count:poisson}: Poisson regression for count data, output mean of Poisson distribution.
#' \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).}
#' \item{ \code{survival:cox}: Cox regression for right censored survival time data (negative values are considered right censored).
#' Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional
#' hazard function \code{h(t) = h0(t) * HR)}.}
#' \item{ \code{survival:aft}: Accelerated failure time model for censored survival time data. See
#' \href{https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html}{Survival Analysis with Accelerated Failure Time}
#' for details.}
#' \item \code{aft_loss_distribution}: Probability Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
#' \item{ \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective.
#' Class is represented by a number and should be from 0 to \code{num_class - 1}.}
#' \item{ \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be
#' further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging
#' to each class.}
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
#' \item{ \code{rank:ndcg}: Use LambdaMART to perform list-wise ranking where
#' \href{https://en.wikipedia.org/wiki/Discounted_cumulative_gain}{Normalized Discounted Cumulative Gain (NDCG)} is maximized.}
#' \item{ \code{rank:map}: Use LambdaMART to perform list-wise ranking where
#' \href{https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision}{Mean Average Precision (MAP)}
#' is maximized.}
#' \item{ \code{reg:gamma}: gamma regression with log-link.
#' Output is a mean of gamma distribution.
#' It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
#' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
#' \item{ \code{eval_metric} evaluation metrics for validation data.
#' Users can pass a self-defined function to it.
#' Default: metric will be assigned according to objective
#' (rmse for regression, and error for classification, mean average precision for ranking).
#' List is provided in detail section.}
#' }
#'
#'
#' @param data training dataset. \code{xgb.train} accepts only an \code{xgb.DMatrix} as the input.
#' \code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or name of a local data file.
#' @param nrounds max number of boosting iterations.
#' @param watchlist named list of xgb.DMatrix datasets to use for evaluating model performance.
#' Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
#' of these datasets during each boosting iteration, and stored in the end as a field named
#' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
#' of these datasets during each boosting iteration, and stored in the end as a field named
#' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
#' \code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
#' printed out during the training.
#' printed out during the training.
#' E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
#' the performance of each round's model on mat1 and mat2.
#' @param obj customized objective function. Returns gradient and second order
#' @param obj customized objective function. Returns gradient and second order
#' \item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
#' Different threshold (e.g., 0.) could be specified as "error@0."
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
#' \item \code{mae} Mean absolute error
#' \item \code{mape} Mean absolute percentage error
#' \item{ \code{auc} Area under the curve.
#' \url{https://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.}
#' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
# Create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df<-data.table(Arthritis,keep.rownames=F)
# Create a copy of the dataset with data.table package
# (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent
# and its performance are really good).
df<-data.table(Arthritis,keep.rownames=FALSE)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
df[,AgeDiscret:=as.factor(round(Age/10,0))]
# Let's add some new categorical features to see if it helps.
# Of course these feature are highly correlated to the Age feature.
# Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features,
# even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age.
# Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
df[,AgeDiscret:=as.factor(round(Age/10,0))]
# Here is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
install.packages('vcd')#Available in Cran. Used for its dataset with categorical values.
install.packages('vcd')#Available in CRAN. Used for its dataset with categorical values.
require(vcd)
}
# According to its documentation, Xgboost works only on numbers.
# Sometimes the dataset we have to work on have categorical data.
# A categorical variable is one which have a fixed number of values. By example, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
# According to its documentation, XGBoost works only on numbers.
# Sometimes the dataset we have to work on have categorical data.
# A categorical variable is one which have a fixed number of values.
# By example, if for each observation a variable called "Colour" can have only
# "red", "blue" or "green" as value, it is a categorical variable.
#
# In R, categorical variable is called Factor.
# In R, categorical variable is called Factor.
# Type ?factor in console for more information.
#
# In this demo we will see how to transform a dense dataframe with categorical variables to a sparse matrix before analyzing it in Xgboost.
# In this demo we will see how to transform a dense dataframe with categorical variables to a sparse matrix
# before analyzing it in XGBoost.
# The method we are going to see is usually called "one hot encoding".
#load Arthritis dataset in memory.
data(Arthritis)
# create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df<-data.table(Arthritis,keep.rownames=F)
# create a copy of the dataset with data.table package
# (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent
# and its performance are really good).
df<-data.table(Arthritis,keep.rownames=FALSE)
# Let's have a look to the data.table
cat("Print the dataset\n")
print(df)
# 2 columns have factor type, one has ordinal type (ordinal variable is a categorical variable with values wich can be ordered, here: None > Some > Marked).
# 2 columns have factor type, one has ordinal type
# (ordinal variable is a categorical variable with values which can be ordered, here: None > Some > Marked).
cat("Structure of the dataset\n")
str(df)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# Let's add some new categorical features to see if it helps.
# Of course these feature are highly correlated to the Age feature.
# Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features,
# even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
df[,AgeDiscret:=as.factor(round(Age/10,0))]
# For the first feature we create groups of age by rounding the real age.
# Note that we transform it to factor (categorical data) so the algorithm treat them as independent values.
df[,AgeDiscret:=as.factor(round(Age/10,0))]
# Here is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
# We remove ID as there is nothing to learn from this feature (it will just add some noise as the dataset is small).
df[,ID:=NULL]
df[,ID:=NULL]
# List the different values for the column Treatment: Placebo, Treated.
cat("Values of the categorical feature Treatment\n")
print(levels(df[,Treatment]))
print(levels(df[,Treatment]))
# Next step, we will transform the categorical data to dummy variables.
# This method is also called one hot encoding.
# The purpose is to transform each value of each categorical feature in one binary feature.
#
# Let's take, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be binary. For example an observation which had the value Placebo in column Treatment before the transformation will have, after the transformation, the value 1 in the new column Placebo and the value 0 in the new column Treated.
# Let's take, the column Treatment will be replaced by two columns, Placebo, and Treated.
# Each of them will be binary.
# For example an observation which had the value Placebo in column Treatment before the transformation will have, after the transformation,
# the value 1 in the new column Placebo and the value 0 in the new column Treated.
#
# Formulae Improved~.-1 used below means transform all categorical features but column Improved to binary values.
# Column Improved is excluded because it will be our output column, the one we want to predict.
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age.
# The second most important feature is having received a placebo or not.
# The sex is third.
# Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).
# Does these result make sense?
# Let's check some Chi2 between each of these features and the outcome.
# Our first simplification of Age gives a Pearson correlation of 8.
print(chisq.test(df$AgeCat,df$Y))
# The perfectly random split I did between young and old at 30 years old have a low correlation of 2. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Don't let your "gut" lower the quality of your model. In "data science", there is science :-)
# The perfectly random split I did between young and old at 30 years old have a low correlation of 2.
# It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that),
# but for the illness we are studying, the age to be vulnerable is not the same.
# Don't let your "gut" lower the quality of your model. In "data science", there is science :-)
# As you can see, in general destroying information by simplifying it won't improve your model. Chi2 just demonstrates that. But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
# As you can see, in general destroying information by simplifying it won't improve your model.
# Chi2 just demonstrates that.
# But in more complex cases, creating a new feature based on existing one which makes link with the outcome
# more obvious may help the algorithm and improve the model.
# The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
# However it's almost always worse when you add some arbitrary rules.
# Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age. Linear model may not be that strong in these scenario.
# Moreover, you can notice that even if we have added some not useful new features highly correlated with
# other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age.
# Linear model may not be that strong in these scenario.
if (file.exists('xgb.DMatrix.data')) file.remove('xgb.DMatrix.data')
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.