xgboost

Author	SHA1	Message	Date
Jiaming Yuan	647d3844dd	Make test for categorical data deterministic. (#8080 )	2022-07-15 14:48:39 +08:00
Jiaming Yuan	dae7a41baa	Update Python requirement to >=3.8. (#8071 ) Additional changes: - Use mamba for CPU test on Jenkins. - Cleanup CPU test dependencies. - Restore some of the modin tests	2022-07-14 18:01:47 +08:00
WeichenXu	176fec8789	PySpark XGBoost integration (#8020 ) Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu> Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>	2022-07-13 13:11:18 +08:00
Jiaming Yuan	8959622836	[dask] Use an invalid port for test. (#8064 )	2022-07-13 11:59:02 +08:00
Jiaming Yuan	4a87ea49b8	Reduce regularization for CPU gblinear. (#8013 )	2022-06-21 01:05:27 +08:00
Jiaming Yuan	b90c6d25e8	Implement `max_cat_threshold` for CPU. (#7957 )	2022-06-04 11:02:46 +08:00
Jiaming Yuan	13b15e07e8	Handle formatted JSON input. (#7953 )	2022-06-01 16:20:58 +08:00
Jiaming Yuan	bde4f25794	Handle missing categorical value in CPU evaluator. (#7948 )	2022-05-27 14:15:47 +08:00
Jiaming Yuan	18cbebaeb9	Unify the cat split storage for CPU. (#7937 ) * Unify the cat split storage for CPU. * Cleanup. * Workaround.	2022-05-26 04:14:40 -07:00
Jiaming Yuan	606be9e663	Handle missing values in one hot splits. (#7934 )	2022-05-24 20:48:41 +08:00
Jiaming Yuan	474366c020	Add convergence test for sparse datasets. (#7922 )	2022-05-23 18:07:26 +08:00
Jiaming Yuan	f93a727869	Address remaining mypy errors in python package. (#7914 )	2022-05-18 22:46:15 +08:00
Rong Ou	77d4a53c32	use RabitContext intead of init/finalize (#7911 )	2022-05-17 12:15:41 +08:00
Jiaming Yuan	db80671d6b	Fix monotone constraint with tuple input. (#7891 )	2022-05-13 04:00:03 +08:00
Jiaming Yuan	46e0bce212	Use maximum category in sketch. (#7853 )	2022-05-05 19:56:49 +08:00
Jiaming Yuan	fdf533f2b9	[POC] Experimental support for l1 error. (#7812 ) Support adaptive tree, a feature supported by both sklearn and lightgbm. The tree leaf is recomputed based on residue of labels and predictions after construction. For l1 error, the optimal value is the median (50 percentile). This is marked as experimental support for the following reasons: - The value is not well defined for distributed training, where we might have empty leaves for local workers. Right now I just use the original leaf value for computing the average with other workers, which might cause significant errors. - Some follow-ups are required, for exact, pruner, and optimization for quantile function. Also, we need to calculate the initial estimation.	2022-04-26 21:41:55 +08:00
Jiaming Yuan	332380479b	Avoid warning in np primitive type tests. (#7833 )	2022-04-23 02:07:01 +08:00
Jiaming Yuan	c70fa502a5	Expose `feature_types` to sklearn interface. (#7821 )	2022-04-21 20:23:35 +08:00
Jiaming Yuan	52d4eda786	Deprecate `use_label_encoder` in XGBClassifier. (#7822 ) * Deprecate `use_label_encoder` in XGBClassifier. * We have removed the encoder, now prepare to remove the indicator.	2022-04-21 13:14:02 +08:00
Jiaming Yuan	9150fdbd4d	Support pandas nullable types. (#7760 )	2022-03-30 08:51:52 +08:00
Jiaming Yuan	a50b84244e	Cleanup configuration for constraints. (#7758 )	2022-03-29 04:22:46 +08:00
Jiaming Yuan	3c9b04460a	Move `num_parallel_tree` to model parameter. (#7751 ) The size of forest should be a property of model itself instead of a training hyper-parameter.	2022-03-29 02:32:42 +08:00
Jiaming Yuan	8b3ecfca25	Mitigate flaky tests. (#7749 ) * Skip non-increasing test with external memory when subsample is used. * Increase bin numbers for boost from prediction test. This mitigates the effect of non-deterministic partitioning.	2022-03-28 21:20:50 +08:00
Jiaming Yuan	4d81c741e9	External memory support for hist (#7531 ) * Generate column matrix from gHistIndex. * Avoid synchronization with the sparse page once the cache is written. * Cleanups: Remove member variables/functions, change the update routine to look like approx and gpu_hist. * Remove pruner.	2022-03-22 00:13:20 +08:00
Xiaochang Wu	613ec36c5a	Support building SimpleDMatrix from Arrow data format (#7512 ) * Integrate with Arrow C data API. * Support Arrow dataset. * Support Arrow table. Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com> Co-authored-by: Zhang Zhang <zhang.zhang@intel.com>	2022-03-15 13:25:19 +08:00
Jiaming Yuan	a62a3d991d	[dask] prediction with categorical data. (#7708 )	2022-03-10 00:21:48 +08:00
Cheng Li	a92e0f6240	multi groups in the constraints (#7711 )	2022-03-01 18:10:15 +08:00
Jiaming Yuan	83a66b4994	Support categorical data for hist. (#7695 ) * Extract partitioner from hist. * Implement categorical data support by passing the gradient index directly into the partitioner. * Organize/update document. * Remove code for negative hessian.	2022-02-25 03:47:14 +08:00
Jiaming Yuan	f08c5dcb06	Cleanup some pylint errors. (#7667 ) * Cleanup some pylint errors. * Cleanup pylint errors in rabit modules. * Make data iter an abstract class and cleanup private access. * Cleanup no-self-use for booster.	2022-02-19 18:53:12 +08:00
Jiaming Yuan	7366d3b20c	Ensure models with categorical splits don't use old binary format. (#7666 )	2022-02-19 08:05:28 +08:00
Jiaming Yuan	0d0abe1845	Support optimal partitioning for GPU hist. (#7652 ) * Implement `MaxCategory` in quantile. * Implement partition-based split for GPU evaluation. Currently, it's based on the existing evaluation function. * Extract an evaluator from GPU Hist to store the needed states. * Added some CUDA stream/event utilities. * Update document with references. * Fixed a bug in approx evaluator where the number of data points is less than the number of categories.	2022-02-15 03:03:12 +08:00
Jiaming Yuan	5cd1f71b51	[dask] Improve configuration for port. (#7645 ) - Try port 0 to let the OS return the available port. - Add port configuration.	2022-02-14 21:34:34 +08:00
Jiaming Yuan	3e693e4f97	[dask] Fix nthread config with dask sklearn wrapper. (#7633 )	2022-02-08 06:38:32 +08:00
Philip Hyunsu Cho	c621775f34	Replace all uses of deprecated function sklearn.datasets.load_boston (#7373 ) * Replace all uses of deprecated function sklearn.datasets.load_boston * More renaming * Fix bad name * Update assertion * Fix n boosted rounds. * Avoid over regularization. * Rebase. * Avoid over regularization. * Whac-a-mole Co-authored-by: fis <jm.yuan@outlook.com>	2022-01-30 04:27:57 -08:00
Philip Hyunsu Cho	b4340abf56	Add special handling for multi:softmax in sklearn predict (#7607 ) * Add special handling for multi:softmax in sklearn predict * Add test coverage	2022-01-29 15:54:49 -08:00
Jiaming Yuan	24789429fd	Support latest pandas Index type. (#7595 )	2022-01-26 18:20:10 +08:00
Jiaming Yuan	ef4dae4c0e	[dask] Add scheduler address to dask config. (#7581 ) - Add user configuration. - Bring back to the logic of using scheduler address from dask. This was removed when we were trying to support GKE, now we bring it back and let xgboost try it if direct guess or host IP from user config failed.	2022-01-22 01:56:32 +08:00
Jiaming Yuan	5ddd4a9d06	Small cleanup to tests. (#7585 ) * Use random port in dask tests to avoid warnings for occupied port. * Increase the difficulty of AUC tests.	2022-01-21 06:26:57 +00:00
Jiaming Yuan	dac9eb13bd	Implement new `save_raw` in Python. (#7572 ) * Expose the new C API function to Python. * Remove old document and helper script. * Small optimization to the `save_raw` and Json ctors.	2022-01-19 02:27:51 +08:00
Jiaming Yuan	cc06fab9a7	Support distributed CPU env for categorical data. (#7575 ) * Add support for cat data in sketch allreduce. * Share tests between CPU and GPU.	2022-01-18 21:56:07 +08:00
Jiaming Yuan	deab0e32ba	Validate out of range categorical value. (#7576 ) * Use float in CPU categorical set to preserve the input value. * Check out of range values.	2022-01-18 20:16:19 +08:00
Jiaming Yuan	d6ea5cc1ed	Cover approx tree method for categorical data tests. (#7569 ) * Add tree to df tests. * Add plotting tests. * Add histogram tests.	2022-01-16 11:31:40 +08:00
Jiaming Yuan	a1bcd33a3b	[breaking] Change internal model serialization to UBJSON. (#7556 ) * Use typed array for models. * Change the memory snapshot format. * Add new C API for saving to raw format.	2022-01-16 02:11:53 +08:00
Jiaming Yuan	13b0fa4b97	Implement `get_group`. (#7564 )	2022-01-16 02:07:42 +08:00
Philip Hyunsu Cho	20c0d60ac7	Restore functionality of max_depth=0 in hist (#7551 ) * Restore functionality of max_depth=0 in hist * Add test case	2022-01-11 01:37:44 +08:00
Jiaming Yuan	001503186c	Rewrite approx (#7214 ) This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing. The rewrite has many benefits: - Support for both `max_leaves` and `max_depth`. - Support for `grow_policy`. - Support for mono constraint. - Support for feature weights. - Support for easier bin configuration (`max_bin`). - Support for categorical data. - Faster performance for most of the datasets. (many times faster) - Support for prediction cache. - Significantly better performance for external memory. - Unites the code base between approx and hist.	2022-01-10 21:15:05 +08:00
Jiaming Yuan	0df2ae63c7	Fix num_boosted_rounds for linear model. (#7538 ) * Add note. * Fix n boosted rounds.	2022-01-05 03:29:33 +08:00
Jiaming Yuan	eb1efb54b5	Define `feature_names_in_`. (#7526 ) * Define `feature_names_in_`. * Raise attribute error if it's not defined.	2022-01-05 01:35:34 +08:00
Jiaming Yuan	8f0a42a266	Initial support for multi-label classification. (#7521 ) * Add support in sklearn classifier.	2022-01-04 23:58:21 +08:00
Ginko Balboa	29bfa94bb6	Fix external memory with gpu_hist and subsampling combination bug. (#7481 ) Instead of accessing data from the `original_page_`, access the data from the first page of the available batch. fix #7476 Co-authored-by: jiamingy <jm.yuan@outlook.com>	2021-12-24 11:15:35 +08:00

1 2 3 4 5 ...

545 Commits