253 Commits

Author SHA1 Message Date
Jiaming Yuan
c311a8c1d8
Enable compiling with system cub. (#7232)
- Tested with all CUDA 11.x.
- Workaround cub scan by using discard iterator in AUC.
- Limit the size of Argsort when compiled with CUDA cub.
2021-09-17 14:28:18 +08:00
Jiaming Yuan
0ed979b096
Support more input types for categorical data. (#7220)
* Support more input types for categorical data.

* Shorten the type name from "categorical" to "c".
* Tests for np/cp array and scipy csr/csc/coo.
* Specify the type for feature info.
2021-09-16 20:39:30 +08:00
Jiaming Yuan
3515931305
Initial support for external memory in gradient index. (#7183)
* Add hessian to batch param in preparation of new approx impl.
* Extract a push method for gradient index matrix.
* Use span instead of vector ref for hessian in sketching.
* Create a binary format for gradient index.
2021-09-13 12:40:56 +08:00
Philip Hyunsu Cho
336af4f974
Work around a segfault observed in SparsePage::Push() (#7161)
* Work around a segfault observed in SparsePage::Push()

* Revert "Work around a segfault observed in SparsePage::Push()"

This reverts commit 30934844d00908750a5442082eb4769b1489f6a9.

* Don't call vector::resize() inside OpenMP block

* Set GITHUB_PAT env var to fix R tests

* Use built-in GITHUB_TOKEN
2021-08-08 02:12:30 -07:00
Jiaming Yuan
8a84be37b8
Pass scikit learn estimator checks for regressor. (#7130)
* Check data shape.
* Check labels.
2021-08-03 18:58:20 +08:00
Robert Maynard
1a75f43304
Allow compilation with nvcc 11.4 (#7131)
* Use type aliases for discard iterators

* update to include host_vector as thrust 1.12 doesn't bring it in as a side-effect

* cub::DispatchRadixSort requires signed offset types
2021-07-27 20:05:33 +08:00
Jiaming Yuan
e6088366df
Export Python Interface for external memory. (#7070)
* Add Python iterator interface.
* Add tests.
* Add demo.
* Add documents.
* Handle empty dataset.
2021-07-22 15:15:53 +08:00
Jiaming Yuan
bd1f3a38f0
Rewrite sparse dmatrix using callbacks. (#7092)
- Reduce dependency on dmlc parsers and provide an interface for users to load data by themselves.
- Remove use of threaded iterator and IO queue.
- Remove `page_size`.
- Make sure the number of pages in memory is bounded.
- Make sure the cache can not be violated.
- Provide an interface for internal algorithms to process data asynchronously.
2021-07-16 12:33:31 +08:00
Jiaming Yuan
84d359efb8
Support host data in proxy DMatrix. (#7087) 2021-07-08 11:35:48 +08:00
Jiaming Yuan
c766f143ab
Refactor external memory formats. (#7089)
* Save base_rowid.
* Return write size.
* Remove unused function.
2021-07-08 04:04:51 +08:00
Jiaming Yuan
116d711815
Make SimpleDMatrix ctor reusable. (#7075) 2021-07-06 13:38:24 +08:00
Jiaming Yuan
1cd20efe68
Move GHistIndex into DMatrix. (#7064) 2021-07-01 00:44:49 +08:00
Jiaming Yuan
1c8fdf2218
Remove use of device_idx in dh::LaunchN. (#7063)
It's an unused parameter, removing it can make the CI log more readable.
2021-06-29 11:37:26 +08:00
Jiaming Yuan
7968c0d051
Test on s390x. (#7038)
* Fix && remove unused parameter.
2021-06-18 14:55:08 +08:00
Jiaming Yuan
86715e4cd4
Support categorical data for dask functional interface and DQM. (#7043)
* Support categorical data for dask functional interface and DQM.

* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.

* Get newer cupy.
2021-06-18 13:06:52 +08:00
Jiaming Yuan
ee4f51a631
Support for all primitive types from array. (#7003)
* Change C API name.
* Test for all primitive types from array.
* Add native support for CPU 128 float.
* Convert boolean and float16 in Python.

* Fix dask version for now.
2021-06-01 08:34:48 +08:00
Jiaming Yuan
4cf95a6041
Support numpy array interface (#6998) 2021-05-27 16:08:22 +08:00
Andrew Ziem
3e7e426b36
Fix spelling in documents (#6948)
* Update roxygen2 doc.

Co-authored-by: fis <jm.yuan@outlook.com>
2021-05-11 20:44:36 +08:00
Jiaming Yuan
37ad60fe25
Enforce input data is not object. (#6927)
* Check for object data type.
* Allow strided arrays with greater underlying buffer size.
2021-05-02 00:09:01 +08:00
Jiaming Yuan
7bcc8b3e5c
Use batched copy if. (#6826) 2021-04-06 10:34:04 +08:00
Jiaming Yuan
a59c7323b4
Fix inplace predict missing value. (#6787) 2021-03-27 05:36:10 +08:00
ShvetsKS
8825670c9c
Memory consumption fix for row-major adapters (#6779)
Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
Co-authored-by: fis <jm.yuan@outlook.com>
2021-03-26 08:44:30 +08:00
Jiaming Yuan
a7083d3c13
Fix dart inplace prediction with GPU input. (#6777)
* Fix dart inplace predict with data on GPU, which might trigger a fatal check
for device access right.
* Avoid copying data whenever possible.
2021-03-25 12:00:32 +08:00
Jiaming Yuan
794fd6a46b
Support v3 cuda array interface. (#6776) 2021-03-25 09:58:09 +08:00
Jiaming Yuan
bcc0277338
Re-implement ROC-AUC. (#6747)
* Re-implement ROC-AUC.

* Binary
* MultiClass
* LTR
* Add documents.

This PR resolves a few issues:
  - Define a value when the dataset is invalid, which can happen if there's an
  empty dataset, or when the dataset contains only positive or negative values.
  - Define ROC-AUC for multi-class classification.
  - Define weighted average value for distributed setting.
  - A correct implementation for learning to rank task.  Previous
  implementation is just binary classification with averaging across groups,
  which doesn't measure ordered learning to rank.
2021-03-20 16:52:40 +08:00
Jiaming Yuan
4ee8340e79
Support column major array. (#6765) 2021-03-20 05:19:46 +08:00
Jiaming Yuan
f20074e826
Check for invalid data. (#6742) 2021-03-04 14:37:20 +08:00
Louis Desreumaux
9b530e5697
Improve OpenMP exception handling (#6680) 2021-02-25 13:56:16 +08:00
Jiaming Yuan
5d48d40d9a
Fix DMatrix slice with feature types. (#6689) 2021-02-09 08:13:51 +08:00
Jiaming Yuan
dbb5208a0a
Use __array_interface__ for creating DMatrix from CSR. (#6675)
* Use __array_interface__ for creating DMatrix from CSR.
* Add configuration.
2021-02-05 21:09:47 +08:00
Jiaming Yuan
1e949110da
Use generic dispatching routine for array interface. (#6672) 2021-02-05 09:23:38 +08:00
Jiaming Yuan
411592a347
Enhance inplace prediction. (#6653)
* Accept array interface for csr and array.
* Accept an optional proxy dmatrix for metainfo.

This constructs an explicit `_ProxyDMatrix` type in Python.

* Remove unused doc.
* Add strict output.
2021-02-02 11:41:46 +08:00
Jiaming Yuan
f2f7dd87b8
Use view for SparsePage exclusively. (#6590) 2021-01-11 18:04:55 +08:00
Jiaming Yuan
80065d571e
[dask] Add DaskXGBRanker (#6576)
* Initial support for distributed LTR using dask.

* Support `qid` in libxgboost.
* Refactor `predict` and `n_features_in_`, `best_[score/iteration/ntree_limit]`
  to avoid duplicated code.
* Define `DaskXGBRanker`.

The dask ranker doesn't support group structure, instead it uses query id and
convert to group ptr internally.
2021-01-08 18:35:09 +08:00
Jiaming Yuan
347f593169
Accept numpy array for DMatrix slice index. (#6368) 2020-12-16 14:42:52 +08:00
ShvetsKS
512b464cfa
Disable HT for DMatrix creation (#6386)
Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2020-11-14 22:18:33 +08:00
Jiaming Yuan
43efadea2e
Deterministic data partitioning for external memory (#6317)
* Make external memory data partitioning deterministic.

* Change the meaning of `page_size` from bytes to number of rows.

* Design a data pool.

* Note for external memory.

* Enable unity build on Windows CI.

* Force garbage collect on test.
2020-11-11 06:11:06 +08:00
Rory Mitchell
29745c6df2
Fix inclusive scan for large sizes (#6234) 2020-11-03 17:01:43 +13:00
Igor Moura
5e1e972aea
Clean up warnings (#6325) 2020-10-30 23:50:29 +08:00
Jiaming Yuan
b180223d18
Cleanup RABIT. (#6290)
* Remove recovery and MPI speed tests.
* Remove readme.
* Remove Python binding.
* Add checks in C API.
2020-10-27 08:48:22 +08:00
Igor Moura
d1254808d5
Clean up C++ warnings (#6213) 2020-10-19 23:02:33 +08:00
Jiaming Yuan
ddf37cca30
Unify thread configuration. (#6186) 2020-10-19 16:05:42 +08:00
Jiaming Yuan
bed7ae4083
Loop over thrust::reduce. (#6229)
* Check input chunk size of dqdm.
* Add doc for current limitation.
2020-10-14 10:40:56 +13:00
Rory Mitchell
734a911a26
Loop over copy_if (#6201)
* Loop over copy_if

* Catch OOM.

Co-authored-by: fis <jm.yuan@outlook.com>
2020-10-14 10:23:16 +13:00
Jiaming Yuan
2241563f23
Handle duplicated values in sketching. (#6178)
* Accumulate weights in duplicated values.
* Fix device id in iterative dmatrix.
2020-10-10 19:32:44 +08:00
Jiaming Yuan
14afdb4d92
Support categorical data in ellpack. (#6140) 2020-09-24 19:28:57 +08:00
Jiaming Yuan
210c131ce7
Support categorical data in GPU sketching. (#6137) 2020-09-21 13:53:06 +08:00
Jiaming Yuan
b5f52f0b1b
Validate weights are positive values. (#6115) 2020-09-15 09:03:55 +08:00
Jiaming Yuan
80c8547147
Make binary bin search reusable. (#6058)
* Move binary search row to hist util.
* Remove dead code.
2020-08-26 05:05:11 +08:00
Jiaming Yuan
20c95be625
Expand categorical node. (#6028)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-25 18:53:57 +08:00