xgboost

Author	SHA1	Message	Date
Jiaming Yuan	97abcc7ee2	Extract interaction constraint from split evaluator. (#5034 ) * Extract interaction constraints from split evaluator. The reason for doing so is mostly for model IO, where num_feature and interaction_constraints are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win. * Enable inc for approx tree method. As now the implementation is spited up from evaluator class, it's also enabled for approx method. * Removing obsoleted code in colmaker. They are never documented nor actually used in real world. Also there isn't a single test for those code blocks. * Unifying the types used for row and column. As the size of input dataset is marching to billion, incorrect use of int is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.	2019-11-14 20:11:41 +08:00
Jiaming Yuan	7663de956c	Run training with empty DMatrix. (#4990 ) This makes GPU Hist robust in distributed environment as some workers might not be associated with any data in either training or evaluation. * Disable rabit mock test for now: See #5012 . * Disable dask-cudf test at prediction for now: See #5003 * Launch dask job for all workers despite they might not have any data. * Check 0 rows in elementwise evaluation metrics. Using AUC and AUC-PR still throws an error. See #4663 for a robust fix. * Add tests for edge cases. * Add `LaunchKernel` wrapper handling zero sized grid. * Move some parts of allreducer into a cu file. * Don't validate feature names when the booster is empty. * Sync number of columns in DMatrix. As num_feature is required to be the same across all workers in data split mode. * Filtering in dask interface now by default syncs all booster that's not empty, instead of using rank 0. * Fix Jenkins' GPU tests. * Install dask-cuda from source in Jenkins' test. Now all tests are actually running. * Restore GPU Hist tree synchronization test. * Check UUID of running devices. The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field. * Fix CMake policy and project variables. Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13. * Fix copying data to CPU * Fix race condition in cpu predictor. * Fix duplicated DMatrix construction. * Don't download extra nccl in CI script.	2019-11-06 16:13:13 +08:00
Jiaming Yuan	f24be2efb4	Use configure_file() to configure version only (#4974 ) * Avoid writing build_config.h * Remove build_config.h all together. * Lint.	2019-10-22 23:47:00 -07:00
Rong Ou	5b1715d97c	Write ELLPACK pages to disk (#4879 ) * add ellpack source * add batch param * extract function to parse cache info * construct ellpack info separately * push batch to ellpack page * write ellpack page. * make sparse page source reusable	2019-10-22 23:44:32 -04:00
Jiaming Yuan	5620322a48	[Breaking] Add global versioning. (#4936 ) * Use CMake config file for representing version. * Generate c and Python version file with CMake. The generated file is written into source tree. But unless XGBoost upgrades its version, there will be no actual modification. This retains compatibility with Makefiles for R. * Add XGBoost version the DMatrix binaries. * Simplify prefetch detection in CMakeLists.txt	2019-10-22 23:27:26 -04:00
Jiaming Yuan	7e477a2adb	Fix data loading (#4862 ) * Fix loading text data. * Fix config regex. * Try to explain the error better in exception. * Update doc.	2019-10-22 12:33:14 -04:00
Jiaming Yuan	095de3bf5f	Export c++ headers in CMake installation. (#4897 ) * Move get transpose into cc. * Clean up headers in host device vector, remove thrust dependency. * Move span and host device vector into public. * Install c++ headers. * Short notes for c and c++. Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>	2019-10-06 23:53:09 -04:00
Chen Qin	512f037e55	[rabit_bootstrap_cache ] failed xgb worker recover from other workers (#4808 ) * Better recovery support. Restarting only the failed workers.	2019-09-16 23:31:52 -04:00
Jiaming Yuan	9700776597	Cudf support. (#4745 ) * Initial support for cudf integration. * Add two C APIs for consuming data and metainfo. * Add CopyFrom for SimpleCSRSource as a generic function to consume the data. * Add FromDeviceColumnar for consuming device data. * Add new MetaInfo::SetInfo for consuming label, weight etc.	2019-08-19 16:51:40 +12:00
Rong Ou	19f9fd5de9	remove the qids_ field in MetaInfo (#4744 )	2019-08-08 10:01:59 +08:00
Rong Ou	6edddd7966	Refactor DMatrix to return batches of different page types (#4686 ) * Use explicit template parameter for specifying page type.	2019-08-03 15:10:34 -04:00
Jiaming Yuan	45876bf41b	Fix external memory for get column batches. (#4622 ) * Fix external memory for get column batches. This fixes two bugs: * Use PushCSC for get column batches. * Don't remove the created temporary directory before finishing test. * Check all pages.	2019-06-30 09:56:49 +08:00
sriramch	6e16900711	Fix crash with approx tree method on cpu (#4510 )	2019-05-30 01:11:29 +08:00
Rong Ou	81c1cd40ca	add a test for cpu predictor using external memory (#4308 ) * add a test for cpu predictor using external memory * allow different page size for testing	2019-04-10 13:25:10 +12:00
Jiaming Yuan	7ea5675679	Add PushCSC for SparsePage. (#4193 ) * Add PushCSC for SparsePage. * Move Push* definitions into cc file. * Add std:: prefix to `size_t` make clang++ happy. * Address monitor count == 0.	2019-03-02 01:58:08 +08:00
Philip Hyunsu Cho	2aaae2e7bb	Fix #4163 : always copy sliced data (#4165 ) * Revert "Accept numpy array view. (#4147)" This reverts commit a985a99cf0dacb26a5d734835473d492d3c2a0df. * Fix #4163: always copy sliced data * Remove print() from the test; check shape equality * Check if 'base' attribute exists * Fix lint * Address reviewer comment * Fix lint	2019-02-20 14:46:34 -08:00
Jiaming Yuan	a985a99cf0	Accept numpy array view. (#4147 ) * Accept array view (slice) in metainfo.	2019-02-18 22:21:34 +08:00
Jiaming Yuan	2e618af743	Fix cpplint. (#4157 ) * Add comment after #endif. * Add missing headers.	2019-02-18 00:16:29 +08:00
Jiaming Yuan	e0a279114e	Unify logging facilities. (#3982 ) * Unify logging facilities. * Enhance `ConsoleLogger` to handle different verbosity. * Override macros from `dmlc`. * Don't use specialized gamma when building with GPU. * Remove verbosity cache in monitor. * Test monitor. * Deprecate `silent`. * Fix doc and messages. * Fix python test. * Fix silent tests.	2018-12-14 19:29:58 +08:00
Rory Mitchell	70d208d68c	Dmatrix refactor stage 2 (#3395 ) * DMatrix refactor 2 * Remove buffered rowset usage where possible * Transition to c++11 style iterators for row access * Transition column iterators to C++ 11	2018-10-01 01:29:03 +13:00
Andy Adinets	72cd1517d6	Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage. (#3446 ) * Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage. - added distributions to HostDeviceVector - using HostDeviceVector for labels, weights and base margings in MetaInfo - using HostDeviceVector for offset and data in SparsePage - other necessary refactoring * Added const version of HostDeviceVector API calls. - const versions added to calls that can trigger data transfers, e.g. DevicePointer() - updated the code that uses HostDeviceVector - objective functions now accept const HostDeviceVector<bst_float>& for predictions * Updated src/linear/updater_gpu_coordinate.cu. * Added read-only state for HostDeviceVector sync. - this means no copies are performed if both host and devices access the HostDeviceVector read-only * Fixed linter and test errors. - updated the lz4 plugin - added ConstDeviceSpan to HostDeviceVector - using device % dh::NVisibleDevices() for the physical device number, e.g. in calls to cudaSetDevice() * Fixed explicit template instantiation errors for HostDeviceVector. - replaced HostDeviceVector<unsigned int> with HostDeviceVector<int> * Fixed HostDeviceVector tests that require multiple GPUs. - added a mock set device handler; when set, it is called instead of cudaSetDevice()	2018-08-30 14:28:47 +12:00
liuliang01	0cf88d036f	Add qid like ranklib format (#2749 ) * add qid for https://github.com/dmlc/xgboost/issues/2748 * change names * change spaces * change qid to bst_uint type * change qid type to size_t * change qid first to SIZE_MAX * change qid type from size_t to uint64_t * update dmlc-core * fix qids name error * fix group_ptr_ error * Style fix * Add qid handling logic to SparsePage * New MetaInfo format + backward compatibility fix Old MetaInfo format (1.0) doesn't contain qid field. We still want to be able to read from MetaInfo files saved in old format. Also, define a new format (2.0) that contains the qid field. This way, we can distinguish files that contain qid and those that do not. * Update MetaInfo test * Simply group assignment logic * Explicitly set qid=nullptr in NativeDataIter NativeDataIter's callback does not support qid field. Users of NativeDataIter will need to call setGroup() function separately to set group information. * Save qids_ in SaveBinary() * Upgrade dmlc-core submodule * Add a test for reading qid * Add contributor * Check the size of qids_ * Document qid format	2018-06-30 20:24:03 +00:00
Rory Mitchell	a96039141a	Dmatrix refactor stage 1 (#3301 ) * Use sparse page as singular CSR matrix representation * Simplify dmatrix methods * Reduce statefullness of batch iterators * BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.	2018-06-07 10:25:58 +12:00
Rory Mitchell	ccf80703ef	Clang-tidy static analysis (#3222 ) * Clang-tidy static analysis * Modernise checks * Google coding standard checks * Identifier renaming according to Google style	2018-04-19 18:57:13 +12:00
EvanChong	790da458e7	Sync number of features after loaded matrix in different workers. (#2722 )	2017-11-29 11:19:12 -08:00
Xiaoguang Sun	2ae56ca84f	Use int32_t explicitly when serializing version (#2389 ) Use int32_t explicitly when serializing version field of dmatrix in binary format. On ILP64 architectures, although very little, size of int is 64 bits.	2017-06-07 10:03:42 -07:00
AbdealiJK	b045ccd764	data.cc: Remove redundant ftype variable	2016-12-04 11:25:57 -08:00
AbdealiJK	6f16f0ef58	Use bst_float consistently throughout (#1824 ) * Fix various typos * Add override to functions that are overridden gcc gives warnings about functions that are being overridden by not being marked as oveirridden. This fixes it. * Use bst_float consistently Use bst_float for all the variables that involve weight, leaf value, gradient, hessian, gain, loss_chg, predictions, base_margin, feature values. In some cases, when due to additions and so on the value can take a larger value, double is used. This ensures that type conversions are minimal and reduces loss of precision.	2016-11-30 10:02:10 -08:00
yuanbowen	5898f1c59e	[DATA] fix instance weights loading	2016-05-23 18:40:41 +08:00
tqchen	ecb3a271be	[PYTHON-DIST] Distributed xgboost python training API.	2016-02-29 16:54:13 -08:00
tqchen	413f119c7e	Update dmlc-core	2016-02-10 13:11:21 -08:00
tqchen	63c4ad7617	[APPROX] Make global proposal default, add group ptr solution	2016-02-10 11:19:10 -08:00
tqchen	ce4d59ed69	[TREE] Enable global proposal for faster speed	2016-02-10 11:19:10 -08:00
Ubuntu	c36195795a	increase shard	2016-02-10 11:17:18 -08:00
Ubuntu	46be6181b5	[DIST] fix distirbuted setting	2016-02-10 11:17:18 -08:00
tqchen	b27b51f60e	[PLUGIN] Add densify parser	2016-02-10 11:17:18 -08:00
tqchen	634db18a0f	[TRAVIS] cleanup travis script	2016-01-16 10:25:12 -08:00
tqchen	fd173e260f	[FIX] change evaluation to more precision	2016-01-16 10:25:12 -08:00
tqchen	67fbf8d264	[TEST] add partial load option	2016-01-16 10:25:12 -08:00
tqchen	6de1c86d18	[LZ4] enable 16 bit index	2016-01-16 10:25:11 -08:00
tqchen	96f4542a67	[PLUGIN] Add plugin system	2016-01-16 10:25:11 -08:00
tqchen	36c389ac46	[DATA] Isolate the format of page file	2016-01-16 10:25:11 -08:00
tqchen	2dc6c2dc52	[R] enable R compile [R] Enable R build for windows and linux	2016-01-16 10:24:02 -08:00
tqchen	72347e2d45	[DATA] Make it fully compatible with rank	2016-01-16 10:24:01 -08:00
tqchen	ef1021e759	[IO] Enable external memory	2016-01-16 10:24:01 -08:00
tqchen	d75e3ed05d	[LIBXGBOOST] pass demo running.	2016-01-16 10:24:01 -08:00
tqchen	dedd87662b	[OBJ] Add basic objective function and registry	2016-01-16 10:24:01 -08:00
tqchen	46bcba7173	[DATA] basic data refactor done, basic version of csr source.	2016-01-16 10:24:00 -08:00
tqchen	3d708e4788	latest data	2016-01-16 10:24:00 -08:00

49 Commits