xgboost

Author	SHA1	Message	Date
Bobby Wang	03cc3b359c	[pyspark] support a list of feature column names (#8117 )	2022-08-08 17:05:27 +08:00
Jiaming Yuan	d87f69215e	Quantile DMatrix for CPU. (#8130 ) - Add a new `QuantileDMatrix` that works for both CPU and GPU. - Deprecate `DeviceQuantileDMatrix`.	2022-08-02 15:51:23 +08:00
Jiaming Yuan	546de5efd2	[pyspark] Cleanup data processing. (#8088 ) - Use numpy stack for handling list of arrays. - Reuse concat function from dask. - Prepare for `QuantileDMatrix`. - Remove unused code. - Use iterator for prediction to avoid initializing xgboost model	2022-07-26 15:00:52 +08:00
Jiaming Yuan	0ce80b7bcf	Mitigate flaky GPU test. (#8078 ) The flakiness is caused by the global random engine, which will take some time to fix.	2022-07-16 13:45:32 +08:00
Jiaming Yuan	7a5586f3db	Fix GPU quantile distributed test. (#8076 )	2022-07-16 11:40:53 +08:00
Jiaming Yuan	647d3844dd	Make test for categorical data deterministic. (#8080 )	2022-07-15 14:48:39 +08:00
WeichenXu	176fec8789	PySpark XGBoost integration (#8020 ) Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu> Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>	2022-07-13 13:11:18 +08:00
Rory Mitchell	794cbaa60a	Fuse split evaluation kernels (#8026 )	2022-07-05 10:24:31 +02:00
Jiaming Yuan	d285d6ba2a	Reduce regularization in GPU gblinear test. (#8010 )	2022-06-20 23:55:12 +08:00
Jiaming Yuan	b90c6d25e8	Implement `max_cat_threshold` for CPU. (#7957 )	2022-06-04 11:02:46 +08:00
Jiaming Yuan	bde4f25794	Handle missing categorical value in CPU evaluator. (#7948 )	2022-05-27 14:15:47 +08:00
Jiaming Yuan	474366c020	Add convergence test for sparse datasets. (#7922 )	2022-05-23 18:07:26 +08:00
Jiaming Yuan	94ca52b7b7	Fix overflow in prediction size. (#7885 )	2022-05-12 02:44:03 +08:00
Jiaming Yuan	46e0bce212	Use maximum category in sketch. (#7853 )	2022-05-05 19:56:49 +08:00
Rory Mitchell	90cce38236	Remove single_precision_histogram for gpu_hist (#7828 )	2022-05-03 14:53:19 +02:00
Jiaming Yuan	50d854e02e	[CI] Test with latest RAPIDS. (#7816 )	2022-04-30 11:55:10 -07:00
Jiaming Yuan	fdf533f2b9	[POC] Experimental support for l1 error. (#7812 ) Support adaptive tree, a feature supported by both sklearn and lightgbm. The tree leaf is recomputed based on residue of labels and predictions after construction. For l1 error, the optimal value is the median (50 percentile). This is marked as experimental support for the following reasons: - The value is not well defined for distributed training, where we might have empty leaves for local workers. Right now I just use the original leaf value for computing the average with other workers, which might cause significant errors. - Some follow-ups are required, for exact, pruner, and optimization for quantile function. Also, we need to calculate the initial estimation.	2022-04-26 21:41:55 +08:00
Jiaming Yuan	8b3ecfca25	Mitigate flaky tests. (#7749 ) * Skip non-increasing test with external memory when subsample is used. * Increase bin numbers for boost from prediction test. This mitigates the effect of non-deterministic partitioning.	2022-03-28 21:20:50 +08:00
Jiaming Yuan	18a4af63aa	Update documents and tests. (#7659 ) * Revise documents after recent refactoring and cat support. * Add tests for behavior of max_depth and max_leaves.	2022-02-26 03:57:47 +08:00
Jiaming Yuan	7366d3b20c	Ensure models with categorical splits don't use old binary format. (#7666 )	2022-02-19 08:05:28 +08:00
Jiaming Yuan	2369d55e9a	Add tests for prediction cache. (#7650 ) * Extract the test from approx for other tree methods. * Add note on how it works.	2022-02-15 00:28:00 +08:00
Jiaming Yuan	b52c4e13b0	[dask] Fix empty partition with pandas input. (#7644 ) Empty partition is different from empty dataset. For the former case, each worker has non-empty dask collections, but each collection might contain empty partition.	2022-02-14 19:35:51 +08:00
Philip Hyunsu Cho	c621775f34	Replace all uses of deprecated function sklearn.datasets.load_boston (#7373 ) * Replace all uses of deprecated function sklearn.datasets.load_boston * More renaming * Fix bad name * Update assertion * Fix n boosted rounds. * Avoid over regularization. * Rebase. * Avoid over regularization. * Whac-a-mole Co-authored-by: fis <jm.yuan@outlook.com>	2022-01-30 04:27:57 -08:00
Jiaming Yuan	ef4dae4c0e	[dask] Add scheduler address to dask config. (#7581 ) - Add user configuration. - Bring back to the logic of using scheduler address from dask. This was removed when we were trying to support GKE, now we bring it back and let xgboost try it if direct guess or host IP from user config failed.	2022-01-22 01:56:32 +08:00
Jiaming Yuan	cc06fab9a7	Support distributed CPU env for categorical data. (#7575 ) * Add support for cat data in sketch allreduce. * Share tests between CPU and GPU.	2022-01-18 21:56:07 +08:00
Jiaming Yuan	deab0e32ba	Validate out of range categorical value. (#7576 ) * Use float in CPU categorical set to preserve the input value. * Check out of range values.	2022-01-18 20:16:19 +08:00
Jiaming Yuan	d6ea5cc1ed	Cover approx tree method for categorical data tests. (#7569 ) * Add tree to df tests. * Add plotting tests. * Add histogram tests.	2022-01-16 11:31:40 +08:00
Jiaming Yuan	001503186c	Rewrite approx (#7214 ) This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing. The rewrite has many benefits: - Support for both `max_leaves` and `max_depth`. - Support for `grow_policy`. - Support for mono constraint. - Support for feature weights. - Support for easier bin configuration (`max_bin`). - Support for categorical data. - Faster performance for most of the datasets. (many times faster) - Support for prediction cache. - Significantly better performance for external memory. - Unites the code base between approx and hist.	2022-01-10 21:15:05 +08:00
Jiaming Yuan	0df2ae63c7	Fix num_boosted_rounds for linear model. (#7538 ) * Add note. * Fix n boosted rounds.	2022-01-05 03:29:33 +08:00
Ginko Balboa	29bfa94bb6	Fix external memory with gpu_hist and subsampling combination bug. (#7481 ) Instead of accessing data from the `original_page_`, access the data from the first page of the available batch. fix #7476 Co-authored-by: jiamingy <jm.yuan@outlook.com>	2021-12-24 11:15:35 +08:00
Jiaming Yuan	58a6723eb1	Initial support for multioutput regression. (#7514 ) * Add num target model parameter, which is configured from input labels. * Change elementwise metric and indexing for weights. * Add demo. * Add tests.	2021-12-18 09:28:38 +08:00
Jiaming Yuan	70b12d898a	[dask] Fix ddqdm with empty partition. (#7510 ) * Fix empty partition. * war.	2021-12-16 20:37:29 +08:00
Jiaming Yuan	a55d43ccfd	Add test for invalid categorical data values. (#7380 ) * Add test for invalid categorical data values. * Add check during sketching.	2021-11-02 18:00:52 +08:00
Jiaming Yuan	a13321148a	Support multi-class with base margin. (#7381 ) This is already partially supported but never properly tested. So the only possible way to use it is calling `numpy.ndarray.flatten` with `base_margin` before passing it into XGBoost. This PR adds proper support for most of the data types along with tests.	2021-11-02 13:38:00 +08:00
Jiaming Yuan	3c4aa9b2ea	[breaking] Remove label encoder deprecated in 1.3. (#7357 )	2021-10-28 13:24:29 +08:00
Jiaming Yuan	ac9bfaa4f2	Handle missing values in dataframe with category dtype. (#7331 ) * Replace -1 in pandas initializer. * Unify `IsValid` functor. * Mimic pandas data handling in cuDF glue code. * Check invalid categories. * Fix DDM sketching.	2021-10-28 03:33:54 +08:00
Jiaming Yuan	d4349426d8	Re-implement PR-AUC. (#7297 ) * Support binary/multi-class classification, ranking. * Add documents. * Handle missing data.	2021-10-26 13:07:50 +08:00
Jiaming Yuan	f999897615	[dask] Use nthread in DMatrix construction. (#7337 ) This is consistent with the thread overriding behavior.	2021-10-20 15:16:40 +08:00
Jiaming Yuan	f56e2e9a66	Support categorical data with pandas Dataframe in inplace prediction (#7322 )	2021-10-17 14:32:06 +08:00
Jiaming Yuan	5b17bb0031	Fix prediction with cat data in sklearn interface. (#7306 ) * Specify DMatrix parameter for pre-processing dataframe. * Add document about the behaviour of prediction.	2021-10-12 14:31:12 +08:00
Jiaming Yuan	298af6f409	Fix weighted samples in multi-class AUC. (#7300 )	2021-10-11 15:12:29 +08:00
Jiaming Yuan	69d3b1b8b4	Remove old callback deprecated in 1.3. (#7280 )	2021-10-08 17:24:59 +08:00
Jiaming Yuan	0ed979b096	Support more input types for categorical data. (#7220 ) * Support more input types for categorical data. * Shorten the type name from "categorical" to "c". * Tests for np/cp array and scipy csr/csc/coo. * Specify the type for feature info.	2021-09-16 20:39:30 +08:00
Jiaming Yuan	037dd0820d	Implement `__sklearn_is_fitted__`. (#7230 )	2021-09-15 19:09:04 +08:00
Jiaming Yuan	d997c967d5	Demo for experimental categorical data support. (#7213 )	2021-09-15 08:20:12 +08:00
Jiaming Yuan	3a4f51f39f	Avoid calling CUDA code on CPU for linear model. (#7154 )	2021-09-01 10:45:31 +08:00
Jiaming Yuan	7a1d67f9cb	[breaking] Use integer atomic for GPU histogram. (#7180 ) On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor. [breaking] Drop non-deterministic histogram. Use fixed point for shared memory. This PR is to improve the performance of GPU Hist. Co-authored-by: Andy Adinets <aadinets@nvidia.com>	2021-08-28 05:17:05 +08:00
Jiaming Yuan	7bdedacb54	Document for `process_type`. (#7135 ) * Update document for prune and refresh. * Add demo.	2021-08-03 13:11:52 +08:00
Jiaming Yuan	e88ac9cc54	[dask] Extend tree stats tests. (#7128 ) * Add tests to GPU. * Assert cover in children sums up to the parent.	2021-07-27 12:22:13 +08:00
Jiaming Yuan	e6088366df	Export Python Interface for external memory. (#7070 ) * Add Python iterator interface. * Add tests. * Add demo. * Add documents. * Handle empty dataset.	2021-07-22 15:15:53 +08:00

1 2 3 4

185 Commits