Commit Graph

253 Commits

Author SHA1 Message Date
Jiaming Yuan
3fd331f8f2 Add checks to C pointer arguments. (#8254) 2022-09-22 19:02:22 +08:00
Jiaming Yuan
fffb1fca52 Calculate base_score based on input labels for mae. (#8107)
Fit an intercept as base score for abs loss.
2022-09-20 20:53:54 +08:00
Jiaming Yuan
bdf265076d Make QuantileDMatrix default to sklearn esitmators. (#8220) 2022-09-13 13:52:19 +08:00
Jiaming Yuan
441ffc017a Copy data from Ellpack to GHist. (#8215) 2022-09-06 23:05:49 +08:00
Jiaming Yuan
16bca5d4a1 Support CPU input for device QuantileDMatrix. (#8136)
- Copy `GHistIndexMatrix` to `Ellpack` when needed.
2022-08-11 21:21:26 +08:00
Jiaming Yuan
446d536c23 Fix loading DMatrix binary in distributed env. (#8149)
- Try to load DMatrix binary before trying to parse text input.
- Remove some unmaintained code.
2022-08-10 22:53:16 +08:00
Jiaming Yuan
2c70751d1e Implement iterative DMatrix for CPU. (#8116) 2022-07-26 22:34:21 +08:00
Jiaming Yuan
4a4e5c7c18 Prepare gradient index for Quantile DMatrix. (#8103)
* Prepare gradient index for Quantile DMatrix.

- Implement push batch with adapter batch.
- Implement `GetFvalue` for prediction.
2022-07-22 17:26:33 +08:00
Jiaming Yuan
4083440690 Small cleanups to various data types. (#8086)
- Use `bst_bin_t` in batch param constructor.
- Use `StringView` to avoid `std::string` when appropriate.
- Avoid using `MetaInfo` in quantile constructor to limit the scope of parameter.
2022-07-18 22:39:36 +08:00
Jiaming Yuan
8dd96013f1 Split up column matrix initialization. (#8060)
* Split up column matrix initialization.

This PR splits the column matrix initialization into 2 steps, the first one initializes
the storage while the second one does the transpose. By doing so, we can reuse the code
for Quantile DMatrix.
2022-07-14 10:34:47 +08:00
Jiaming Yuan
8746f9cddf Rename IterativeDMatrix. (#8045) 2022-07-04 18:52:31 +08:00
Jiaming Yuan
f0c1b842bf Implement sketching with adapter. (#8019) 2022-06-23 00:03:02 +08:00
Jiaming Yuan
142a208a90 Fix compiler warnings. (#8022)
- Remove/fix unused parameters
- Remove deprecated code in rabit.
- Update dmlc-core.
2022-06-22 21:29:10 +08:00
Jiaming Yuan
1a33b50a0d Fix compiler warnings. (#7974)
- Remove unused parameters. There are still many warnings that are not yet
addressed. Currently, the warnings in dmlc-core dominate the error log.
- Remove `distributed` parameter from metric.
- Fixes some warnings about signed comparison.
2022-06-06 22:56:25 +08:00
Jiaming Yuan
d48123d23b Fix rmm build (#7973)
- Optionally switch to c++17
- Use rmm CMake target.
- Workaround compiler errors.
- Fix GPUMetric inheritance.
- Run death tests even if it's built with RMM support.

Co-authored-by: jakirkham <jakirkham@gmail.com>
2022-06-06 20:18:32 +08:00
Jiaming Yuan
18a38f7ca0 Refactor for GHistIndex. (#7923)
* Pass sparse page as adapter, which prepares for quantile dmatrix.
* Remove old external memory code like `rbegin` and extra `Init` function.
* Simplify type dispatch.
2022-05-23 23:04:53 +08:00
Jiaming Yuan
765097d514 Simplify inplace-predict. (#7910)
Pass the `X` as part of Proxy DMatrix instead of an independent `dmlc::any`.
2022-05-18 17:52:00 +08:00
Jiaming Yuan
19775ffe15 Use adapter to initialize column matrix. (#7912) 2022-05-18 16:15:12 +08:00
Jiaming Yuan
11d65fcb21 Extract partial sum into an independent function. (#7889) 2022-05-13 14:30:35 +08:00
Jiaming Yuan
288c52596c Define bin type. (#7850) 2022-04-29 19:41:39 +08:00
Jiaming Yuan
fdf533f2b9 [POC] Experimental support for l1 error. (#7812)
Support adaptive tree, a feature supported by both sklearn and lightgbm.  The tree leaf is recomputed based on residue of labels and predictions after construction.

For l1 error, the optimal value is the median (50 percentile).

This is marked as experimental support for the following reasons:
- The value is not well defined for distributed training, where we might have empty leaves for local workers. Right now I just use the original leaf value for computing the average with other workers, which might cause significant errors.
- Some follow-ups are required, for exact, pruner, and optimization for quantile function. Also, we need to calculate the initial estimation.
2022-04-26 21:41:55 +08:00
Jiaming Yuan
522636cb52 Bump version. (#7769) 2022-03-31 06:33:22 +08:00
Jiaming Yuan
64575591d8 Use context in SetInfo. (#7687)
* Use the name `Context`.
* Pass a context object into `SetInfo`.
* Add context to proxy matrix.
* Add context to iterative DMatrix.

This is to remove the use of the default number of threads during `SetInfo` as a follow-up on
removing the global omp variable while preparing for CUDA stream semantic.  Currently, XGBoost
uses the legacy CUDA stream, we will gradually remove them in the future in favor of non-blocking streams.
2022-03-24 22:16:26 +08:00
Jiaming Yuan
4d81c741e9 External memory support for hist (#7531)
* Generate column matrix from gHistIndex.
* Avoid synchronization with the sparse page once the cache is written.
* Cleanups: Remove member variables/functions, change the update routine to look like approx and gpu_hist.
* Remove pruner.
2022-03-22 00:13:20 +08:00
Jiaming Yuan
e78a38b837 Sort sparse page index when constructing DMatrix. (#7731) 2022-03-16 18:01:05 +08:00
Xiaochang Wu
613ec36c5a Support building SimpleDMatrix from Arrow data format (#7512)
* Integrate with Arrow C data API.
* Support Arrow dataset.
* Support Arrow table.

Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Zhang Zhang <zhang.zhang@intel.com>
2022-03-15 13:25:19 +08:00
Jiaming Yuan
98d6faefd6 Implement slope for Pseduo-Huber. (#7727)
* Add objective and metric.
* Some refactoring for CPU/GPU dispatching using linalg module.
2022-03-14 21:42:38 +08:00
Jiaming Yuan
6762c45494 Small cleanup to gradient index and hist. (#7668)
* Code comments.
* Const accessor to index.
* Remove some weird variables in the `Index` class.
* Simplify the `MemStackAllocator`.
2022-02-23 11:37:21 +08:00
Jiaming Yuan
2775c2a1ab Prepare external memory support for hist. (#7638)
This PR prepares the GHistIndexMatrix to host the column matrix which is used by the hist tree method by accepting sparse_threshold parameter.

Some cleanups are made to ensure the correct batch param is being passed into DMatrix along with some additional tests for correctness of SimpleDMatrix.
2022-02-10 16:58:02 +08:00
Jiaming Yuan
81210420c6 Remove omp_get_max_threads (#7608)
This is the one last PR for removing omp global variable.

* Add context object to the `DMatrix`.  This bridges `DMatrix` with https://github.com/dmlc/xgboost/issues/7308 .
* Require context to be available at the construction time of booster.
* Add `n_threads` support for R csc DMatrix constructor.
* Remove `omp_get_max_threads` in R glue code.
* Remove threading utilities that rely on omp global variable.
2022-01-28 16:09:22 +08:00
Jiaming Yuan
e060519d4f Avoid regenerating the gradient index for approx. (#7591) 2022-01-26 21:41:30 +08:00
Jiaming Yuan
5d7818e75d Remove omp_get_max_threads in tree updaters. (#7590) 2022-01-26 19:55:47 +08:00
Jiaming Yuan
5817840858 Remove omp_get_max_threads in data. (#7588) 2022-01-24 02:44:07 +08:00
Jiaming Yuan
2db808021d Silent some warnings for unused variable. (#7548) 2022-01-11 01:16:26 +08:00
Jiaming Yuan
58a6723eb1 Initial support for multioutput regression. (#7514)
* Add num target model parameter, which is configured from input labels.
* Change elementwise metric and indexing for weights.
* Add demo.
* Add tests.
2021-12-18 09:28:38 +08:00
Jiaming Yuan
9ab73f737e Extract Sketch Entry from hist maker. (#7503)
* Extract Sketch Entry from hist maker.

* Add a new sketch container for sorted inputs.
* Optimize bin search.
2021-12-18 05:36:56 +08:00
Jiaming Yuan
5b1161bb64 Convert labels into tensor. (#7456)
* Add a new ctor to tensor for `initilizer_list`.
* Change labels from host device vector to tensor.
* Rename the field from `labels_` to `labels` since it's a public member.
2021-12-17 00:58:35 +08:00
Jiaming Yuan
70b12d898a [dask] Fix ddqdm with empty partition. (#7510)
* Fix empty partition.

* war.
2021-12-16 20:37:29 +08:00
Jiaming Yuan
85cbd32c5a Add range-based slicing to tensor view. (#7453) 2021-11-27 13:42:36 +08:00
Jiaming Yuan
557ffc4bf5 Reduce base margin to 2 dim for now. (#7455) 2021-11-27 00:46:13 +08:00
Jiaming Yuan
5262e933f7 Remove unnecessary constexpr. (#7466) 2021-11-23 16:42:08 +08:00
Jiaming Yuan
d33854af1b [Breaking] Accept multi-dim meta info. (#7405)
This PR changes base_margin into a 3-dim array, with one of them being reserved for multi-target classification. Also, a breaking change is made for binary serialization due to extra dimension along with a fix for saving the feature weights. Lastly, it unifies the prediction initialization between CPU and GPU. After this PR, the meta info setter in Python will be based on array interface.
2021-11-18 23:02:54 +08:00
Jiaming Yuan
b0015fda96 Fix R CRAN failures. (#7404)
* Remove hist builder dtor.

* Initialize values.

* Tolerance.

* Remove the use of nthread in col maker.
2021-11-16 10:51:12 +08:00
Jiaming Yuan
55ee272ea8 Extend array interface to handle ndarray. (#7434)
* Extend array interface to handle ndarray.

The `ArrayInterface` class is extended to support multi-dim array inputs. Previously this
class handles only 2-dim (vector is also matrix).  This PR specifies the expected
dimension at compile-time and the array interface can perform various checks automatically
for input data. Also, adapters like CSR are more rigorous about their input.  Lastly, row
vector and column vector are handled without intervention from the caller.
2021-11-16 09:52:15 +08:00
Jiaming Yuan
d4274bc556 Fix typo. (#7433) 2021-11-15 01:28:11 +08:00
Jiaming Yuan
6ede12412c Update dmlc-core and use data iter for GPU sampling tests. (#7398)
* Update dmlc-core.
* New parquet parser in dmlc-core.
* Use data iter for GPU sampling tests.
2021-11-06 05:12:49 +08:00
Jiaming Yuan
ccdabe4512 Support building gradient index with cat data. (#7371) 2021-11-03 22:37:37 +08:00
Jiaming Yuan
a13321148a Support multi-class with base margin. (#7381)
This is already partially supported but never properly tested. So the only possible way to use it is calling `numpy.ndarray.flatten` with `base_margin` before passing it into XGBoost. This PR adds proper support
for most of the data types along with tests.
2021-11-02 13:38:00 +08:00
Jiaming Yuan
ac9bfaa4f2 Handle missing values in dataframe with category dtype. (#7331)
* Replace -1 in pandas initializer.
* Unify `IsValid` functor.
* Mimic pandas data handling in cuDF glue code.
* Check invalid categories.
* Fix DDM sketching.
2021-10-28 03:33:54 +08:00
Jiaming Yuan
d1f00fb0b7 Stricter validation for group. (#7345) 2021-10-21 12:13:33 +08:00