xgboost

Author	SHA1	Message	Date
Jiaming Yuan	83a66b4994	Support categorical data for hist. (#7695 ) * Extract partitioner from hist. * Implement categorical data support by passing the gradient index directly into the partitioner. * Organize/update document. * Remove code for negative hessian.	2022-02-25 03:47:14 +08:00
Jiaming Yuan	6762c45494	Small cleanup to gradient index and hist. (#7668 ) * Code comments. * Const accessor to index. * Remove some weird variables in the `Index` class. * Simplify the `MemStackAllocator`.	2022-02-23 11:37:21 +08:00
Jiaming Yuan	0d0abe1845	Support optimal partitioning for GPU hist. (#7652 ) * Implement `MaxCategory` in quantile. * Implement partition-based split for GPU evaluation. Currently, it's based on the existing evaluation function. * Extract an evaluator from GPU Hist to store the needed states. * Added some CUDA stream/event utilities. * Update document with references. * Fixed a bug in approx evaluator where the number of data points is less than the number of categories.	2022-02-15 03:03:12 +08:00
Jiaming Yuan	2369d55e9a	Add tests for prediction cache. (#7650 ) * Extract the test from approx for other tree methods. * Add note on how it works.	2022-02-15 00:28:00 +08:00
Jiaming Yuan	2775c2a1ab	Prepare external memory support for hist. (#7638 ) This PR prepares the GHistIndexMatrix to host the column matrix which is used by the hist tree method by accepting sparse_threshold parameter. Some cleanups are made to ensure the correct batch param is being passed into DMatrix along with some additional tests for correctness of SimpleDMatrix.	2022-02-10 16:58:02 +08:00
Jiaming Yuan	5d7818e75d	Remove `omp_get_max_threads` in tree updaters. (#7590 )	2022-01-26 19:55:47 +08:00
Jiaming Yuan	5817840858	Remove `omp_get_max_threads` in data. (#7588 )	2022-01-24 02:44:07 +08:00
Jiaming Yuan	465dc63833	Fix tree param feature type. (#7565 )	2022-01-16 04:46:29 +08:00
Jiaming Yuan	a1bcd33a3b	[breaking] Change internal model serialization to UBJSON. (#7556 ) * Use typed array for models. * Change the memory snapshot format. * Add new C API for saving to raw format.	2022-01-16 02:11:53 +08:00
Jiaming Yuan	001503186c	Rewrite approx (#7214 ) This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing. The rewrite has many benefits: - Support for both `max_leaves` and `max_depth`. - Support for `grow_policy`. - Support for mono constraint. - Support for feature weights. - Support for easier bin configuration (`max_bin`). - Support for categorical data. - Faster performance for most of the datasets. (many times faster) - Support for prediction cache. - Significantly better performance for external memory. - Unites the code base between approx and hist.	2022-01-10 21:15:05 +08:00
Jiaming Yuan	7f399eac8b	Use double for GPU Hist node sum. (#7507 )	2021-12-22 08:41:35 +08:00
Jiaming Yuan	9ab73f737e	Extract Sketch Entry from hist maker. (#7503 ) * Extract Sketch Entry from hist maker. * Add a new sketch container for sorted inputs. * Optimize bin search.	2021-12-18 05:36:56 +08:00
Jiaming Yuan	eee527d264	Add approx partitioner. (#7467 )	2021-11-27 15:22:06 +08:00
Jiaming Yuan	bf7bb575b4	Test CPU histogram with cat data. (#7465 )	2021-11-27 00:43:28 +08:00
Jiaming Yuan	176110a22d	Support external memory in CPU histogram building. (#7372 )	2021-11-23 01:13:33 +08:00
Jiaming Yuan	9fb4338964	Add test for eta and mitigate float error. (#7446 ) * Add eta test. * Don't skip test.	2021-11-18 20:42:48 +08:00
Jiaming Yuan	d7d1b6e3a6	CPU evaluation for cat data. (#7393 ) * Implementation for one hot based. * Implementation for partition based. (LightGBM)	2021-11-06 14:41:35 +08:00
Jiaming Yuan	6ede12412c	Update dmlc-core and use data iter for GPU sampling tests. (#7398 ) * Update dmlc-core. * New parquet parser in dmlc-core. * Use data iter for GPU sampling tests.	2021-11-06 05:12:49 +08:00
Jiaming Yuan	b06040b6d0	Implement a general array view. (#7365 ) * Replace existing matrix and vector view. This is to prepare for handling higher dimension data and prediction when we support multi-target models.	2021-11-05 04:16:11 +08:00
Jiaming Yuan	4100827971	Pass infomation about objective to tree methods. (#7385 ) * Define the `ObjInfo` and pass it down to every tree updater.	2021-11-04 01:52:44 +08:00
Jiaming Yuan	ccdabe4512	Support building gradient index with cat data. (#7371 )	2021-11-03 22:37:37 +08:00
Jiaming Yuan	8d7c6366d7	Accept histogram cut instead gradient index in evaluation. (#7336 )	2021-10-20 18:04:46 +08:00
Jiaming Yuan	8e619010d0	Extract CPUExpandEntry and HistParam. (#7321 ) * Remove kRootNid. * Check for empty hessian.	2021-10-17 14:22:25 +08:00
Jiaming Yuan	130df8cdda	Add tests for tree grow policy. (#7302 )	2021-10-12 15:04:06 +08:00
Jiaming Yuan	0ed979b096	Support more input types for categorical data. (#7220 ) * Support more input types for categorical data. * Shorten the type name from "categorical" to "c". * Tests for np/cp array and scipy csr/csc/coo. * Specify the type for feature info.	2021-09-16 20:39:30 +08:00
Jiaming Yuan	7a1d67f9cb	[breaking] Use integer atomic for GPU histogram. (#7180 ) On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor. [breaking] Drop non-deterministic histogram. Use fixed point for shared memory. This PR is to improve the performance of GPU Hist. Co-authored-by: Andy Adinets <aadinets@nvidia.com>	2021-08-28 05:17:05 +08:00
Jiaming Yuan	149f209af6	Extract histogram builder from CPU Hist. (#7152 ) * Extract the CPU histogram builder. * Fix tests. * Reduce number of histograms being built.	2021-08-09 21:15:21 +08:00
Jiaming Yuan	bd1f3a38f0	Rewrite sparse dmatrix using callbacks. (#7092 ) - Reduce dependency on dmlc parsers and provide an interface for users to load data by themselves. - Remove use of threaded iterator and IO queue. - Remove `page_size`. - Make sure the number of pages in memory is bounded. - Make sure the cache can not be violated. - Provide an interface for internal algorithms to process data asynchronously.	2021-07-16 12:33:31 +08:00
Jiaming Yuan	615ab2b03e	Extract evaluate splits from CPU hist. (#7079 ) Other than modularizing the split evaluation function, this PR also removes some more functions including `InitNewNodes` and `BuildNodeStats` among some other unused variables. Also, scattered code like setting leaf weights is grouped into the split evaluator and `NodeEntry` is simplified and made private. Another subtle difference with the original implementation is that the modified code doesn't call `tree[nidx].Parent()` to traversal upward.	2021-07-07 15:16:25 +08:00
Jiaming Yuan	1cd20efe68	Move `GHistIndex` into `DMatrix`. (#7064 )	2021-07-01 00:44:49 +08:00
Jiaming Yuan	29f8fd6fee	Support categorical split in tree model dump. (#7036 )	2021-06-18 16:46:20 +08:00
ShvetsKS	2567404ab6	Simplify sparse and dense CPU hist kernels (#7029 ) * Simplify sparse and dense kernels * Extract row partitioner. Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>	2021-06-11 18:26:30 +08:00
ShvetsKS	5cdaac00c1	Remove feature grouping (#7018 ) Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>	2021-06-03 04:35:26 +08:00
ShvetsKS	57c732655e	Merge lossgude and depthwise strategies for CPU hist (#7007 ) * fix java/scala test: max depth is also valid parameter for lossguide Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>	2021-06-03 01:49:43 +08:00
ShvetsKS	55b823b27d	Reduce 'InitSampling' complexity and set gradients to zero (#6922 ) Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>	2021-05-29 04:52:23 +08:00
Jiaming Yuan	556a83022d	Implement unified update prediction cache for (gpu_)hist. (#6860 ) * Implement utilites for linalg. * Unify the update prediction cache functions. * Implement update prediction cache for multi-class gpu hist.	2021-04-17 00:29:34 +08:00
Jiaming Yuan	4f75f514ce	Fix GPU RF (#6755 ) * Fix sampling.	2021-03-17 06:23:35 +08:00
Igor Rukhovich	19a2c54265	Prediction by indices (subsample < 1) (#6683 ) * Another implementation of predicting by indices * Fixed omp parallel_for variable type * Removed SparsePageView from Updater	2021-03-16 15:08:20 +13:00
Jiaming Yuan	f2f7dd87b8	Use view for `SparsePage` exclusively. (#6590 )	2021-01-11 18:04:55 +08:00
ShvetsKS	956beead70	Thread local memory allocation for BuildHist (#6358 ) * thread mem locality * fix apply * cleanup * fix lint * fix tests * simple try * fix * fix * apply comments * fix comments * fix * apply simple comment Co-authored-by: ShvetsKS <kirill.shvets@intel.com>	2020-11-25 17:50:12 +03:00
Jiaming Yuan	d711d648cb	Fix label errors in graph visualization (#6369 )	2020-11-11 17:44:59 -08:00
Sergio Gavilán	b181a88f9f	Reduced some C++ compiler warnings (#6197 ) * Removed some warnings * Rebase with master * Solved C++ Google Tests errors made by refactoring in order to remove warnings * Undo renaming path -> path_ * Fix style check Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>	2020-10-29 12:36:00 -07:00
Jiaming Yuan	b5b24354b8	More categorical tests and disable shap sparse test. (#6219 ) * Fix tree load with 32 category.	2020-10-10 16:12:37 +08:00
Jiaming Yuan	444131a2e6	Add categorical data support to GPU Hist. (#6164 )	2020-09-29 11:27:25 +08:00
Jiaming Yuan	14afdb4d92	Support categorical data in ellpack. (#6140 )	2020-09-24 19:28:57 +08:00
Jiaming Yuan	2fcc4f2886	Unify evaluation functions. (#6037 )	2020-08-26 14:23:27 +08:00
Jiaming Yuan	20c95be625	Expand categorical node. (#6028 ) Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>	2020-08-25 18:53:57 +08:00
Jiaming Yuan	a144daf034	Limit tree depth for GPU hist. (#6045 )	2020-08-22 19:34:52 +08:00
Qi Zhang	989ddd036f	Swap byte-order in binary serializer to support big-endian arch (#5813 ) * fixed some endian issues * Use dmlc::ByteSwap() to simplify code * Fix lint check * [CI] Add test for s390x * Download latest CMake on s390x * Fix a bug in my code * Save magic number in dmatrix with byteswap on big-endian machine * Save version in binary with byteswap on big-endian machine * Load scalar with byteswap in MetaInfo * Add a debugging message * Handle arrays correctly when byteswapping * EOF can also be 255 * Handle magic number in MetaInfo carefully * Skip Tree.Load test for big-endian, since the test manually builds little-endian binary model * Handle missing packages in Python tests * Don't use boto3 in model compatibility tests * Add s390 Docker file for local testing * Add model compatibility tests * Add R compatibility test * Revert "Add R compatibility test" This reverts commit c2d2bdcb7dbae133cbb927fcd20f7e83ee2b18a8. Co-authored-by: Qi Zhang <q.zhang@ibm.com> Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>	2020-08-18 14:47:17 -07:00
Jiaming Yuan	4d99c58a5f	Feature weights (#5962 )	2020-08-18 19:55:41 +08:00

1 2 3 4

162 Commits