This PR prepares the GHistIndexMatrix to host the column matrix which is used by the hist tree method by accepting sparse_threshold parameter.
Some cleanups are made to ensure the correct batch param is being passed into DMatrix along with some additional tests for correctness of SimpleDMatrix.
* Add num target model parameter, which is configured from input labels.
* Change elementwise metric and indexing for weights.
* Add demo.
* Add tests.
* Add a new ctor to tensor for `initilizer_list`.
* Change labels from host device vector to tensor.
* Rename the field from `labels_` to `labels` since it's a public member.
This PR changes base_margin into a 3-dim array, with one of them being reserved for multi-target classification. Also, a breaking change is made for binary serialization due to extra dimension along with a fix for saving the feature weights. Lastly, it unifies the prediction initialization between CPU and GPU. After this PR, the meta info setter in Python will be based on array interface.
* Extend array interface to handle ndarray.
The `ArrayInterface` class is extended to support multi-dim array inputs. Previously this
class handles only 2-dim (vector is also matrix). This PR specifies the expected
dimension at compile-time and the array interface can perform various checks automatically
for input data. Also, adapters like CSR are more rigorous about their input. Lastly, row
vector and column vector are handled without intervention from the caller.
This is already partially supported but never properly tested. So the only possible way to use it is calling `numpy.ndarray.flatten` with `base_margin` before passing it into XGBoost. This PR adds proper support
for most of the data types along with tests.
* Add hessian to batch param in preparation of new approx impl.
* Extract a push method for gradient index matrix.
* Use span instead of vector ref for hessian in sketching.
* Create a binary format for gradient index.
- Reduce dependency on dmlc parsers and provide an interface for users to load data by themselves.
- Remove use of threaded iterator and IO queue.
- Remove `page_size`.
- Make sure the number of pages in memory is bounded.
- Make sure the cache can not be violated.
- Provide an interface for internal algorithms to process data asynchronously.
* Support categorical data for dask functional interface and DQM.
* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.
* Get newer cupy.
* Initial support for distributed LTR using dask.
* Support `qid` in libxgboost.
* Refactor `predict` and `n_features_in_`, `best_[score/iteration/ntree_limit]`
to avoid duplicated code.
* Define `DaskXGBRanker`.
The dask ranker doesn't support group structure, instead it uses query id and
convert to group ptr internally.
* Make external memory data partitioning deterministic.
* Change the meaning of `page_size` from bytes to number of rows.
* Design a data pool.
* Note for external memory.
* Enable unity build on Windows CI.
* Force garbage collect on test.
* Add thread local return entry for DMatrix.
* Save feature name and feature type in binary file.
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Use hypothesis
* Allow int64 array interface for groups
* Add packages to Windows CI
* Add to travis
* Make sure device index is set correctly
* Fix dask-cudf test
* appveyor
* Group aware GPU weighted sketching.
* Distribute group weights to each data point.
* Relax the test.
* Validate input meta info.
* Fix metainfo copy ctor.