* Fix#3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows
Description: The bug is triggered when
1. The data matrix has empty rows at the bottom. More precisely, the rows
`n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.
Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.
Fix: Modify the row pointer.
* Add regression test
The base margin will need to have length `[num_class] * [number of data points]`.
Otherwise, the array holding prediction results will be only partially
initialized, causing undefined behavior.
Fix: check the length of the base margin. If the length is not correct,
use the global bias (`base_score`) instead. Warn the user about the
substitution.
* Fix#3402: wrong fid crashes distributed algorithm
The bug was introduced by the recent DMatrix refactor (#3301). It was partially
fixed by #3408 but the example in #3402 was still failing. The example in #3402
will succeed after this fix is applied.
* Explicitly specify "this" to prevent compile error
* Add regression test
* Add distributed test to Travis matrix
* Install kubernetes Python package as dependency of dmlc tracker
* Add Python dependencies
* Add compile step
* Reduce size of regression test case
* Further reduce size of test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add new
* update doc
* finish Gang Scheduling
* more
* intro
* Add sections: Prediction, Model persistence and ML pipeline.
* Add XGBoost4j-Spark MLlib pipeline example
* partial finished version
* finish the doc
* adjust code
* fix the doc
* use rst
* Convert XGBoost4J-Spark tutorial to reST
* Bring XGBoost4J up to date
* add note about using hdfs
* remove duplicate file
* fix descriptions
* update doc
* Wrap HDFS/S3 export support as a note
* update
* wrap indexing_mode example in code block
This bring many goodies, including:
* Ability to specify delimiter and weight_column for CSV files:
```python
dtrain = xgboost.DMatrix('train.csv?format=csv&label_column=0&weight_column=1&delimiter= ')
```
* Ability to choose between 0-based and 1-based indexing for LIBSVM/LIBFM files:
```python
dtrain = xgboost.DMatrix('train.libsvm?indexing_mode=1') # use 1-based indexing
dtest = xgboost.DMatrix('test.libsvm') # use 0-based indexing (default)
dtest2 = xgboost.DMatrix('test2.libsvm?indexing_mode=-1') # use heuristic to detect 0-based / 1-based
```
* Fix a bug in float parsing (issue dmlc/dmlc-core#440)
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider spark.task.cpus when controlling parallelism
* fix bug
* fix conf setup
* calculate requestedCores within ParallelismController
* enforce spark.task.cpus = 1
* unify unit test case framework
* enable spark ui
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider missing value in prediction
* handle single prediction instance
* fix type conversion
* Fix bug of using list(x) function when x is string
list('abcdcba') = ['a', 'b', 'c', 'd', 'c', 'b', 'a']
* Allow feature_names/feature_types to be of any type
If feature_names/feature_types is iterable, e.g. tuple, list, then convert the value to list, except for string; otherwise construct a list with a single value
* Delete excess whitespace
* Fix whitespace to pass lint
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)
* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.
* Fix appveyor R test
* Save max_delta_step as an extra attribute of Booster
Fixes#3509 and #3026, where `max_delta_step` parameter gets lost during serialization.
* fix lint
* Use camel case for global constant
* disable local variable case in clang-tidy
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
Add `'total_gain'` and `'total_cover'` as possible `importance_type`
arguments to `Booster.get_score` in the Python package.
`get_score` already accepts a `'gain'` argument, which returns each
feature's average gain over all of its splits. `'total_gain'` does the
same, but returns a total rather than an average. This seems more
intuitively meaningful, and also matches the behavior of the R package's
`xgb.importance` function.
I also added an analogous `'total_cover'` command for consistency.
This should resolve#3484.
* Improved library loading a bit
* Fixed indentation.
* Fixes according to the discussion
* Moved the comment to a separate line.
* specified exception type
* Change doc build to reST exclusively
* Rewrite Intro doc in reST; create toctree
* Update parameter and contribute
* Convert tutorials to reST
* Convert Python tutorials to reST
* Convert CLI and Julia docs to reST
* Enable markdown for R vignettes
* Done migrating to reST
* Add guzzle_sphinx_theme to requirements
* Add breathe to requirements
* Fix search bar
* Add link to user forum
* Fail GPU CI after test failure
* Fix GPU linear tests
* Reduced number of GPU tests to speed up CI
* Remove static allocations of device memory
* Resolve illegal memory access for updater_fast_hist.cc
* Fix broken r tests dependency
* Update python install documentation for GPU
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* disable booster setup in spark
* check in parameter conversion
* fix compilation issue
* update exception type