Fix #3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows (#3553)

* Fix #3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows

Description: The bug is triggered when

1. The data matrix has empty rows at the bottom. More precisely, the rows
   `n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
   dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.

Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.

Fix: Modify the row pointer.

* Add regression test
This commit is contained in:
Philip Hyunsu Cho 2018-08-05 10:15:42 -07:00 committed by GitHub
parent 8c633d1ca3
commit 109473dae2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 17 additions and 0 deletions

View File

@ -77,6 +77,18 @@ test_that("xgb.DMatrix: slice, dim", {
expect_equal(getinfo(dsub1, 'label'), getinfo(dsub2, 'label'))
})
test_that("xgb.DMatrix: slice, trailing empty rows", {
data(agaricus.train, package='xgboost')
train_data <- agaricus.train$data
train_label <- agaricus.train$label
dtrain <- xgb.DMatrix(data=train_data, label=train_label)
slice(dtrain, 6513L)
train_data[6513, ] <- 0
dtrain <- xgb.DMatrix(data=train_data, label=train_label)
slice(dtrain, 6513L)
expect_equal(nrow(dtrain), 6513)
})
test_that("xgb.DMatrix: colnames", {
dtest <- xgb.DMatrix(test_data, label=test_label)
expect_equal(colnames(dtest), colnames(test_data))

View File

@ -332,7 +332,12 @@ XGB_DLL int XGDMatrixCreateFromCSCEx(const size_t* col_ptr,
mat.info.num_row_ = mat.page_.offset.size() - 1;
if (num_row > 0) {
CHECK_LE(mat.info.num_row_, num_row);
// provision for empty rows at the bottom of matrix
for (uint64_t i = mat.info.num_row_; i < static_cast<uint64_t>(num_row); ++i) {
mat.page_.offset.push_back(mat.page_.offset.back());
}
mat.info.num_row_ = num_row;
CHECK_EQ(mat.info.num_row_, mat.page_.offset.size() - 1); // sanity check
}
mat.info.num_col_ = ncol;
mat.info.num_nonzero_ = nelem;