This PR introduces a new parameter `extmem_concat_pages` to make the page concatenation optional for GPU hist. In addition, the document is updated for the new GPU-based external memory.
- Expose the maximum number of cached nodes to be consistent with the CPU implementation. Also easier for testing.
- Extract the subtraction trick for easier testing.
- Split up the `GradientQuantiser` to avoid circular dependency.
- Expose `NumBatches` in `DMatrix`.
- Small cleanup for removing legacy CUDA stream and ~force CUDA context initialization~.
- Purge old external memory data generation code.
- Use `UpdatePosition` for all nodes and skip `FinalizePosition` when external memory is used.
- Create `encode/decode` for node position, this is just as a refactor.
- Reuse code between update position and finalization.
* Cleanup GPU Hist tests.
- Remove GPU Hist gradient sampling test. The same properties are tested in the gradient
sampler test suite.
- Move basic histogram tests into the histogram test suite.
- Remove the header inclusion of the `updater_gpu_hist.cu` in tests.
- Support resource view in ellpack.
- Define the CUDA version of MMAP resource.
- Define the CUDA version of malloc resource.
- Refactor cuda runtime API wrappers, and add memory access related wrappers.
- gather windows macros into a single header.
This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features:
- Federated learning for both CPU and GPU.
- NCCL.
- More data types.
- A unified interface for all the underlying implementations.
- Improved timeout handling for both tracker and workers.
- Exhausted tests with metrics (fixed a couple of bugs along the way).
- A reusable tracker for Python and JVM packages.
- Use std::uint64_t instead of size_t to avoid implementation-defined type.
- Rename to bst_idx_t, to account for other types of indexing.
- Small cleanup to the base header.
* [coll] Pass context to various functions.
In the future, the `Context` object would be required for collective operations, this PR
passes the context object to some required functions to prepare for swapping out the
implementation.
- Use the `linalg::Matrix` for storing gradients.
- New API for the custom objective.
- Custom objective for multi-class/multi-target is now required to return the correct shape.
- Custom objective for Python can accept arrays with any strides. (row-major, column-major)
- Pass context from booster to DMatrix.
- Use context instead of integer for `n_threads`.
- Check the consistency configuration for `max_bin`.
- Test for all combinations of initialization options.
- Fix prediction range.
- Support prediction cache in mt-hist.
- Support model slicing.
- Make the booster a Python iterable by defining `__iter__`.
- Cleanup removed/deprecated parameters.
- A new field in the output model `iteration_indptr` for pointing to the ranges of trees for each iteration.
- Pass obj info into tree updater as const pointer.
This way we don't have to initialize the learner model param before configuring gbm, hence
breaking up the dependency of configurations.
* Prepare for improving Windows networking compatibility.
* Include dmlc filesystem indirectly as dmlc/filesystem.h includes windows.h, which
conflicts with winsock2.h
* Define `NOMINMAX` conditionally.
* Link the winsock library when mysys32 is used.
* Add config file for read the doc.
- Remove unused parameters. There are still many warnings that are not yet
addressed. Currently, the warnings in dmlc-core dominate the error log.
- Remove `distributed` parameter from metric.
- Fixes some warnings about signed comparison.
Support adaptive tree, a feature supported by both sklearn and lightgbm. The tree leaf is recomputed based on residue of labels and predictions after construction.
For l1 error, the optimal value is the median (50 percentile).
This is marked as experimental support for the following reasons:
- The value is not well defined for distributed training, where we might have empty leaves for local workers. Right now I just use the original leaf value for computing the average with other workers, which might cause significant errors.
- Some follow-ups are required, for exact, pruner, and optimization for quantile function. Also, we need to calculate the initial estimation.
* Implement `MaxCategory` in quantile.
* Implement partition-based split for GPU evaluation. Currently, it's based on the existing evaluation function.
* Extract an evaluator from GPU Hist to store the needed states.
* Added some CUDA stream/event utilities.
* Update document with references.
* Fixed a bug in approx evaluator where the number of data points is less than the number of categories.
This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing.
The rewrite has many benefits:
- Support for both `max_leaves` and `max_depth`.
- Support for `grow_policy`.
- Support for mono constraint.
- Support for feature weights.
- Support for easier bin configuration (`max_bin`).
- Support for categorical data.
- Faster performance for most of the datasets. (many times faster)
- Support for prediction cache.
- Significantly better performance for external memory.
- Unites the code base between approx and hist.
On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor.
[breaking] Drop non-deterministic histogram.
Use fixed point for shared memory.
This PR is to improve the performance of GPU Hist.
Co-authored-by: Andy Adinets <aadinets@nvidia.com>
- Reduce dependency on dmlc parsers and provide an interface for users to load data by themselves.
- Remove use of threaded iterator and IO queue.
- Remove `page_size`.
- Make sure the number of pages in memory is bounded.
- Make sure the cache can not be violated.
- Provide an interface for internal algorithms to process data asynchronously.