160 Commits

Author SHA1 Message Date
Jiaming Yuan
deab0e32ba
Validate out of range categorical value. (#7576)
* Use float in CPU categorical set to preserve the input value.
* Check out of range values.
2022-01-18 20:16:19 +08:00
Jiaming Yuan
d6ea5cc1ed
Cover approx tree method for categorical data tests. (#7569)
* Add tree to df tests.
* Add plotting tests.
* Add histogram tests.
2022-01-16 11:31:40 +08:00
Jiaming Yuan
001503186c
Rewrite approx (#7214)
This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing.

The rewrite has many benefits:
- Support for both `max_leaves` and `max_depth`.
- Support for `grow_policy`.
- Support for mono constraint.
- Support for feature weights.
- Support for easier bin configuration (`max_bin`).
- Support for categorical data.
- Faster performance for most of the datasets. (many times faster)
- Support for prediction cache.
- Significantly better performance for external memory.
- Unites the code base between approx and hist.
2022-01-10 21:15:05 +08:00
Jiaming Yuan
0df2ae63c7
Fix num_boosted_rounds for linear model. (#7538)
* Add note.

* Fix n boosted rounds.
2022-01-05 03:29:33 +08:00
Ginko Balboa
29bfa94bb6
Fix external memory with gpu_hist and subsampling combination bug. (#7481)
Instead of accessing data from the `original_page_`, access the data from the first page of the available batch.

fix #7476

Co-authored-by: jiamingy <jm.yuan@outlook.com>
2021-12-24 11:15:35 +08:00
Jiaming Yuan
58a6723eb1
Initial support for multioutput regression. (#7514)
* Add num target model parameter, which is configured from input labels.
* Change elementwise metric and indexing for weights.
* Add demo.
* Add tests.
2021-12-18 09:28:38 +08:00
Jiaming Yuan
70b12d898a
[dask] Fix ddqdm with empty partition. (#7510)
* Fix empty partition.

* war.
2021-12-16 20:37:29 +08:00
Jiaming Yuan
a55d43ccfd
Add test for invalid categorical data values. (#7380)
* Add test for invalid categorical data values.

* Add check during sketching.
2021-11-02 18:00:52 +08:00
Jiaming Yuan
a13321148a
Support multi-class with base margin. (#7381)
This is already partially supported but never properly tested. So the only possible way to use it is calling `numpy.ndarray.flatten` with `base_margin` before passing it into XGBoost. This PR adds proper support
for most of the data types along with tests.
2021-11-02 13:38:00 +08:00
Jiaming Yuan
3c4aa9b2ea
[breaking] Remove label encoder deprecated in 1.3. (#7357) 2021-10-28 13:24:29 +08:00
Jiaming Yuan
ac9bfaa4f2
Handle missing values in dataframe with category dtype. (#7331)
* Replace -1 in pandas initializer.
* Unify `IsValid` functor.
* Mimic pandas data handling in cuDF glue code.
* Check invalid categories.
* Fix DDM sketching.
2021-10-28 03:33:54 +08:00
Jiaming Yuan
d4349426d8
Re-implement PR-AUC. (#7297)
* Support binary/multi-class classification, ranking.
* Add documents.
* Handle missing data.
2021-10-26 13:07:50 +08:00
Jiaming Yuan
f999897615
[dask] Use nthread in DMatrix construction. (#7337)
This is consistent with the thread overriding behavior.
2021-10-20 15:16:40 +08:00
Jiaming Yuan
f56e2e9a66
Support categorical data with pandas Dataframe in inplace prediction (#7322) 2021-10-17 14:32:06 +08:00
Jiaming Yuan
5b17bb0031
Fix prediction with cat data in sklearn interface. (#7306)
* Specify DMatrix parameter for pre-processing dataframe.
* Add document about the behaviour of prediction.
2021-10-12 14:31:12 +08:00
Jiaming Yuan
298af6f409
Fix weighted samples in multi-class AUC. (#7300) 2021-10-11 15:12:29 +08:00
Jiaming Yuan
69d3b1b8b4
Remove old callback deprecated in 1.3. (#7280) 2021-10-08 17:24:59 +08:00
Jiaming Yuan
0ed979b096
Support more input types for categorical data. (#7220)
* Support more input types for categorical data.

* Shorten the type name from "categorical" to "c".
* Tests for np/cp array and scipy csr/csc/coo.
* Specify the type for feature info.
2021-09-16 20:39:30 +08:00
Jiaming Yuan
037dd0820d
Implement __sklearn_is_fitted__. (#7230) 2021-09-15 19:09:04 +08:00
Jiaming Yuan
d997c967d5
Demo for experimental categorical data support. (#7213) 2021-09-15 08:20:12 +08:00
Jiaming Yuan
3a4f51f39f
Avoid calling CUDA code on CPU for linear model. (#7154) 2021-09-01 10:45:31 +08:00
Jiaming Yuan
7a1d67f9cb
[breaking] Use integer atomic for GPU histogram. (#7180)
On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor.

    [breaking] Drop non-deterministic histogram.
    Use fixed point for shared memory.

This PR is to improve the performance of GPU Hist. 

Co-authored-by: Andy Adinets <aadinets@nvidia.com>
2021-08-28 05:17:05 +08:00
Jiaming Yuan
7bdedacb54
Document for process_type. (#7135)
* Update document for prune and refresh.

* Add demo.
2021-08-03 13:11:52 +08:00
Jiaming Yuan
e88ac9cc54
[dask] Extend tree stats tests. (#7128)
* Add tests to GPU.
* Assert cover in children sums up to the parent.
2021-07-27 12:22:13 +08:00
Jiaming Yuan
e6088366df
Export Python Interface for external memory. (#7070)
* Add Python iterator interface.
* Add tests.
* Add demo.
* Add documents.
* Handle empty dataset.
2021-07-22 15:15:53 +08:00
Jiaming Yuan
a5d222fcdb
Handle categorical split in model histogram and dataframe. (#7065)
* Error on get_split_value_histogram when feature is categorical
* Add a category column to output dataframe
2021-07-02 13:10:36 +08:00
Jiaming Yuan
8fa32fdda2
Implement categorical data support for SHAP. (#7053)
* Add CPU implementation.
* Update GPUTreeSHAP.
* Add GPU implementation by defining custom split condition.
2021-06-25 19:02:46 +08:00
Jiaming Yuan
1d4d345634
Tests for dask skl categorical data support. (#7054) 2021-06-24 16:33:57 +08:00
Jiaming Yuan
29f8fd6fee
Support categorical split in tree model dump. (#7036) 2021-06-18 16:46:20 +08:00
Jiaming Yuan
86715e4cd4
Support categorical data for dask functional interface and DQM. (#7043)
* Support categorical data for dask functional interface and DQM.

* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.

* Get newer cupy.
2021-06-18 13:06:52 +08:00
Jiaming Yuan
d9799b09d0
Categorical data support for cuDF. (#7042)
* Add support in DMatrix.
* Add support in DQM, except for iterator.
2021-06-17 13:54:33 +08:00
Jiaming Yuan
72f9daf9b6
Fix gpu_id with custom objective. (#7015) 2021-06-09 14:51:17 +08:00
Jiaming Yuan
c4b9f4f622
Add enable_categorical to sklearn. (#7011) 2021-06-04 02:29:14 +08:00
Jiaming Yuan
ee4f51a631
Support for all primitive types from array. (#7003)
* Change C API name.
* Test for all primitive types from array.
* Add native support for CPU 128 float.
* Convert boolean and float16 in Python.

* Fix dask version for now.
2021-06-01 08:34:48 +08:00
Jiaming Yuan
44cc9c04ea
Fix multiclass auc with empty dataset. (#6947) 2021-05-12 15:01:14 +08:00
Jiaming Yuan
05ac415780
[dask] Set dataframe index in predict. (#6944) 2021-05-12 13:24:21 +08:00
Jiaming Yuan
bec2b4f094
Revert "Use CPU input for test_boost_from_prediction. (#6818)" (#6858)
This reverts commit 74f3a2f4b5c2654af90d1477fd543b5d97280fbe.
2021-04-20 14:54:02 +08:00
Jiaming Yuan
dee5ef2dfd
Typehint for Sklearn. (#6799) 2021-04-14 06:55:21 +08:00
Jiaming Yuan
74f3a2f4b5
Use CPU input for test_boost_from_prediction. (#6818) 2021-04-02 00:11:35 +08:00
Jiaming Yuan
47b62480af
More general predict proba. (#6817)
* Use `output_margin` for `softmax`.
* Add test for dask binary cls.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-04-01 19:52:12 +08:00
Jiaming Yuan
79b8b560d2
Optimize dart inplace predict perf. (#6804) 2021-03-31 15:20:54 +08:00
Jiaming Yuan
a59c7323b4
Fix inplace predict missing value. (#6787) 2021-03-27 05:36:10 +08:00
Jiaming Yuan
a7083d3c13
Fix dart inplace prediction with GPU input. (#6777)
* Fix dart inplace predict with data on GPU, which might trigger a fatal check
for device access right.
* Avoid copying data whenever possible.
2021-03-25 12:00:32 +08:00
Jiaming Yuan
bcc0277338
Re-implement ROC-AUC. (#6747)
* Re-implement ROC-AUC.

* Binary
* MultiClass
* LTR
* Add documents.

This PR resolves a few issues:
  - Define a value when the dataset is invalid, which can happen if there's an
  empty dataset, or when the dataset contains only positive or negative values.
  - Define ROC-AUC for multi-class classification.
  - Define weighted average value for distributed setting.
  - A correct implementation for learning to rank task.  Previous
  implementation is just binary classification with averaging across groups,
  which doesn't measure ordered learning to rank.
2021-03-20 16:52:40 +08:00
Jiaming Yuan
4ee8340e79
Support column major array. (#6765) 2021-03-20 05:19:46 +08:00
Jiaming Yuan
4f75f514ce
Fix GPU RF (#6755)
* Fix sampling.
2021-03-17 06:23:35 +08:00
Jiaming Yuan
325bc93e16
[dask] Use distributed.MultiLock (#6743)
* [dask] Use `distributed.MultiLock`

This enables training multiple models in parallel.

* Conditionally import `MultiLock`.
* Use async train directly in scikit learn interface.
* Use `worker_client` when available.
2021-03-16 14:19:41 +08:00
Jiaming Yuan
1335db6113
[dask] Improve documents. (#6687)
* Add tag for versions.
* use autoclass in sphinx build.
Made some class methods to be private to avoid exporting documents.
2021-02-09 09:20:58 +08:00
Jiaming Yuan
4656b09d5d
[breaking] Add prediction fucntion for DMatrix and use inplace predict for dask. (#6668)
* Add a new API function for predicting on `DMatrix`.  This function aligns
with rest of the `XGBoosterPredictFrom*` functions on semantic of function
arguments.
* Purge `ntree_limit` from libxgboost, use iteration instead.
* [dask] Use `inplace_predict` by default for dask sklearn models.
* [dask] Run prediction shape inference on worker instead of client.

The breaking change is in the Python sklearn `apply` function, I made it to be
consistent with other prediction functions where `best_iteration` is used by
default.
2021-02-08 18:26:32 +08:00
Jiaming Yuan
411592a347
Enhance inplace prediction. (#6653)
* Accept array interface for csr and array.
* Accept an optional proxy dmatrix for metainfo.

This constructs an explicit `_ProxyDMatrix` type in Python.

* Remove unused doc.
* Add strict output.
2021-02-02 11:41:46 +08:00