541 Commits

Author SHA1 Message Date
Jiaming Yuan
19b59938b7
Convert input to str for hypothesis note. (#9480) 2023-08-15 02:27:58 +08:00
Jiaming Yuan
05d7000096
Handle special characters in JSON model dump. (#9474) 2023-08-14 15:49:00 +08:00
Jiaming Yuan
801116c307
Test scikit-learn model IO with gblinear. (#9459) 2023-08-13 23:41:49 +08:00
Jiaming Yuan
f05a23b41c
Use weakref instead of id for DataIter cache. (#9445)
- Fix case where Python reuses id from freed objects.
- Small optimization to column matrix with QDM by using `realloc` instead of copying data.
2023-08-10 00:40:06 +08:00
Philip Hyunsu Cho
7ce090e775
Handle UTF-8 paths correctly on Windows platform (#9443)
* Fix round-trip serialization with UTF-8 paths

* Add compiler version check

* Add comment to C API functions

* Add Python tests

* [CI] Updatre MacOS deployment target

* Use std::filesystem instead of dmlc::TemporaryDirectory
2023-08-07 23:27:25 -07:00
Jiaming Yuan
54029a59af
Bound the size of the histogram cache. (#9440)
- A new histogram collection with a limit in size.
- Unify histogram building logic between hist, multi-hist, and approx.
2023-08-08 03:21:26 +08:00
Jiaming Yuan
912e341d57
Initial GPU support for the approx tree method. (#9414) 2023-07-31 15:50:28 +08:00
Jiaming Yuan
851cba931e
Define best_iteration only if early stopping is used. (#9403)
* Define `best_iteration` only if early stopping is used.

This is the behavior specified by the document but not honored in the actual code.

- Don't set the attributes if there's no early stopping.
- Clean up the code for callbacks, and replace assertions with proper exceptions.
- Assign the attributes when early stopping `save_best` is used.
- Turn the attributes into Python properties.

---------

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2023-07-24 12:43:35 +08:00
Jiaming Yuan
01e00efc53
[breaking] Remove support for single string feature info. (#9401)
- Input must be a sequence of strings.
- Improve validation error message.
2023-07-24 11:06:30 +08:00
Jiaming Yuan
16eb41936d
Handle the new device parameter in dask and demos. (#9386)
* Handle the new `device` parameter in dask and demos.

- Check no ordinal is specified in the dask interface.
- Update demos.
- Update dask doc.
- Update the condition for QDM.
2023-07-15 19:11:20 +08:00
Jiaming Yuan
9da5050643
Turn warning messages into Python warnings. (#9387) 2023-07-15 07:46:43 +08:00
Jiaming Yuan
04aff3af8e
Define the new device parameter. (#9362) 2023-07-13 19:30:25 +08:00
Jiaming Yuan
97ed944209
Unify the hist tree method for different devices. (#9363) 2023-07-11 10:04:39 +08:00
Jiaming Yuan
20c52f07d2
Support exporting cut values (#9356) 2023-07-08 15:32:41 +08:00
Jiaming Yuan
e964654b8f
[skl] Enable cat feature without specifying tree method. (#9353) 2023-07-03 22:06:17 +08:00
Jiaming Yuan
39390cc2ee
[breaking] Remove the predictor param, allow fallback to prediction using DMatrix. (#9129)
- A `DeviceOrd` struct is implemented to indicate the device. It will eventually replace the `gpu_id` parameter.
- The `predictor` parameter is removed.
- Fallback to `DMatrix` when `inplace_predict` is not available.
- The heuristic for choosing a predictor is only used during training.
2023-07-03 19:23:54 +08:00
Jiaming Yuan
f4798718c7
Use hist as the default tree method. (#9320) 2023-06-27 23:04:24 +08:00
Jiaming Yuan
6d22ea793c
Test QDM with sparse data on CPU. (#9316) 2023-06-19 21:27:03 +08:00
Jiaming Yuan
ee6809e642
Use mmap for external memory. (#9282)
- Have basic infrastructure for mmap.
- Release file write handle.
2023-06-19 18:52:55 +08:00
Jiaming Yuan
1fcc26a6f8
Set ndcg to default for LTR. (#8822)
- Add document.
- Add tests.
- Use `ndcg` with `topk` as default.
2023-06-09 23:31:33 +08:00
Jiaming Yuan
9fbde21e9d
Rework the precision metric. (#9222)
- Rework the precision metric for both CPU and GPU.
- Mention it in the document.
- Cleanup old support code for GPU ranking metric.
- Deterministic GPU implementation.

* Drop support for classification.

* type.

* use batch shape.

* lint.

* cpu build.

* cpu build.

* lint.

* Tests.

* Fix.

* Cleanup error message.
2023-06-02 20:49:43 +08:00
Jiaming Yuan
3913ff470f
Import data lazily during tests. (#9176) 2023-05-23 03:58:31 +08:00
Jiaming Yuan
1f9a57d17b
[Breaking] Require format to be specified in input URI. (#9077)
Previously, we use `libsvm` as default when format is not specified. However, the dmlc
data parser is not particularly robust against errors, and the most common type of error
is undefined format.

Along with which, we will recommend users to use other data loader instead. We will
continue the maintenance of the parsers as it's currently used for many internal tests
including federated learning.
2023-04-28 19:45:15 +08:00
Jiaming Yuan
e206b899ef
Rework MAP and Pairwise for LTR. (#9075) 2023-04-28 02:39:12 +08:00
Jiaming Yuan
2c8d735cb3
Fix tests with pandas 2.0. (#9014)
* Fix tests with pandas 2.0.

- `is_categorical` is replaced by `is_categorical_dtype`.
- one hot encoding returns boolean type instead of integer type.
2023-04-11 00:17:34 +08:00
Jiaming Yuan
bac22734fb
Remove ntree limit in python package. (#8345)
- Remove `ntree_limit`. The parameter has been deprecated since 1.4.0.
- The SHAP package compatibility is broken.
2023-03-31 19:01:55 +08:00
Jiaming Yuan
acc110c251
[MT-TREE] Support prediction cache and model slicing. (#8968)
- Fix prediction range.
- Support prediction cache in mt-hist.
- Support model slicing.
- Make the booster a Python iterable by defining `__iter__`.
- Cleanup removed/deprecated parameters.
- A new field in the output model `iteration_indptr` for pointing to the ranges of trees for each iteration.
2023-03-27 23:10:54 +08:00
Jiaming Yuan
c2b3a13e70
[breaking][skl] Remove parameter serialization. (#8963)
- Remove parameter serialization in the scikit-learn interface.

The scikit-lear interface `save_model` will save only the model and discard all
hyper-parameters. This is to align with the native XGBoost interface, which distinguishes
the hyper-parameter and model parameters.

With the scikit-learn interface, model parameters are attributes of the estimator. For
instance, `n_features_in_`, `n_classes_` are always accessible with
`estimator.n_features_in_` and `estimator.n_classes_`, but not with the
`estimator.get_params`.

- Define a `load_model` method for classifier to load its own attributes.

- Set n_estimators to None by default.
2023-03-27 21:34:10 +08:00
Jiaming Yuan
151882dd26
Initial support for multi-target tree. (#8616)
* Implement multi-target for hist.

- Add new hist tree builder.
- Move data fetchers for tests.
- Dispatch function calls in gbm base on the tree type.
2023-03-22 23:49:56 +08:00
Jiaming Yuan
5891f752c8
Rework the MAP metric. (#8931)
- The new implementation is more strict as only binary labels are accepted. The previous implementation converts values greater than 1 to 1.
- Deterministic GPU. (no atomic add).
- Fix top-k handling.
- Precise definition of MAP. (There are other variants on how to handle top-k).
- Refactor GPU ranking tests.
2023-03-22 17:45:20 +08:00
Jiaming Yuan
f186c87cf9
Check inf in data for all types of DMatrix. (#8911) 2023-03-15 11:24:35 +08:00
Jiaming Yuan
910ce580c8
Clear all cache after model load. (#8904) 2023-03-14 22:09:36 +08:00
Jiaming Yuan
7eba285a1e
Support sklearn cross validation for ranker. (#8859)
* Support sklearn cross validation for ranker.

- Add a convention for X to include a special `qid` column.

sklearn utilities consider only `X`, `y` and `sample_weight` for supervised learning
algorithms, but we need an additional qid array for ranking.

It's important to be able to support the cross validation function in sklearn since all
other tuning functions like grid search are based on cross validation.
2023-03-07 00:22:08 +08:00
Jiaming Yuan
228a46e8ad
Support learning rate for zero-hessian objectives. (#8866) 2023-03-06 20:33:28 +08:00
Jiaming Yuan
6a892ce281
Specify src path for isort. (#8867) 2023-03-06 17:30:27 +08:00
Jiaming Yuan
cce4af4acf
Initial support for quantile loss. (#8750)
- Add support for Python.
- Add objective.
2023-02-16 02:30:18 +08:00
Jiaming Yuan
457f704e3d
Add quantile metric. (#8761) 2023-02-13 19:07:40 +08:00
Jiaming Yuan
225b3158f6
Support custom metric in sklearn ranker. (#8786) 2023-02-12 13:14:07 +08:00
Jiaming Yuan
8a16944664
Fix ranking with quantile dmatrix and group weight. (#8762) 2023-02-10 20:32:35 +08:00
Jiaming Yuan
199c421d60
Send default configuration from metric to objective. (#8760) 2023-02-09 20:18:07 +08:00
Jiaming Yuan
4ead65a28c
Increase timeout limit for linear. (#8767) 2023-02-09 18:20:12 +08:00
Jiaming Yuan
7b3d473593
[doc] Add demo for inference using individual tree. (#8752) 2023-02-07 04:40:18 +08:00
Jiaming Yuan
c1786849e3
Use array interface for CSC matrix. (#8672)
* Use array interface for CSC matrix.

Use array interface for CSC matrix and align the interface with CSR and dense.

- Fix nthread issue in the R package DMatrix.
- Unify the behavior of handling `missing` with other inputs.
- Unify the behavior of handling `missing` around R, Python, Java, and Scala DMatrix.
- Expose `num_non_missing` to the JVM interface.
- Deprecate old CSR and CSC constructors.
2023-02-05 01:59:46 +08:00
BenEfrati
213b5602d9
Add sample_weight to eval_metric (#8706) 2023-02-05 00:06:38 +08:00
Jiaming Yuan
0e61ba57d6
Fix GPU L1 error. (#8749) 2023-02-04 03:02:00 +08:00
Jiaming Yuan
1325ba9251
Support primitive types of pyarrow-backed pandas dataframe. (#8653)
Categorical data (dictionary) is not supported at the moment.
2023-01-30 17:53:29 +08:00
James Lamb
96e6b6beba
[ci] remove unused imports in tests (#8707) 2023-01-25 14:10:29 +08:00
Jiaming Yuan
31b9cbab3d
Make sure input numpy array is aligned. (#8690)
- use `np.require` to specify that the alignment is required.
- scipy csr as well.
- validate input pointer in `ArrayInterface`.
2023-01-18 08:12:13 +08:00
Jiaming Yuan
247946a875
Cache transformed in QuantileDMatrix for efficiency. (#8666) 2023-01-17 06:02:40 +08:00
Jiaming Yuan
d6018eb4b9
Remove all use of DeviceQuantileDMatrix. (#8665) 2023-01-17 00:04:10 +08:00