* Fix#3730: scikit-learn 0.20 compatibility fix
sklearn.cross_validation has been removed from scikit-learn 0.20,
so replace it with sklearn.model_selection
* Display test names for Python tests for clarity
* Show inherited members of XGBRegressor in API doc, since XGBRegressor uses default methods from XGBModel
* Add table of contents to Python API doc
* Skip JVM doc download if not available
* Show inherited members for XGBRegressor
* Add docstring to XGBRegressor.predict()
* Fix rendering errors in Python docstrings
* Fix lint
This pull request amends the broken #3062 allow Spark 2.2 to work.
Please note this won't work in Spark <=2.1 as sc.removeSparkListener was implemented in Spark 2.2. (So perhaps a more general method is better, although that is what was attempted in #3062)
This PR fixes: #3208, #3151 and the discussion in #1927.
I do find it strange that #3062 dose not work in Spark 2.2, it's probably due to some sort of public/private issue in the org.apache.spark.scheduler.LiveListenerBus class inheritance (In Spark itself). The error is: `java.lang.NoSuchMethodError: org.apache.spark.scheduler.LiveListenerBus.removeListener(Ljava/lang/Object;)V`
* Adding Java/Scala doc build to Jenkins CI
* Deploy built doc to S3 bucket
* Build doc only for branches
* Build doc first, to get doc faster for branch updates
* Have ReadTheDocs download doc tarball from S3
* Update JVM doc links
* Put doc build commands in a script
* Specify Spark 2.3+ requirement for XGBoost4J-Spark
* Build GPU wheel without NCCL, to reduce binary size
* Revert "Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)"
This reverts commit 44811f233071c5805d70c287abd22b155b732727.
* Document behavior of predict() for DART booster
* Add notice to parameter.rst
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* partial finish
* no test
* add test cases
* add test cases
* address comments
* add test for regressor
* fix typo
* Fix#3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows
Description: The bug is triggered when
1. The data matrix has empty rows at the bottom. More precisely, the rows
`n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.
Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.
Fix: Modify the row pointer.
* Add regression test
The base margin will need to have length `[num_class] * [number of data points]`.
Otherwise, the array holding prediction results will be only partially
initialized, causing undefined behavior.
Fix: check the length of the base margin. If the length is not correct,
use the global bias (`base_score`) instead. Warn the user about the
substitution.
* Fix#3402: wrong fid crashes distributed algorithm
The bug was introduced by the recent DMatrix refactor (#3301). It was partially
fixed by #3408 but the example in #3402 was still failing. The example in #3402
will succeed after this fix is applied.
* Explicitly specify "this" to prevent compile error
* Add regression test
* Add distributed test to Travis matrix
* Install kubernetes Python package as dependency of dmlc tracker
* Add Python dependencies
* Add compile step
* Reduce size of regression test case
* Further reduce size of test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add new
* update doc
* finish Gang Scheduling
* more
* intro
* Add sections: Prediction, Model persistence and ML pipeline.
* Add XGBoost4j-Spark MLlib pipeline example
* partial finished version
* finish the doc
* adjust code
* fix the doc
* use rst
* Convert XGBoost4J-Spark tutorial to reST
* Bring XGBoost4J up to date
* add note about using hdfs
* remove duplicate file
* fix descriptions
* update doc
* Wrap HDFS/S3 export support as a note
* update
* wrap indexing_mode example in code block
This bring many goodies, including:
* Ability to specify delimiter and weight_column for CSV files:
```python
dtrain = xgboost.DMatrix('train.csv?format=csv&label_column=0&weight_column=1&delimiter= ')
```
* Ability to choose between 0-based and 1-based indexing for LIBSVM/LIBFM files:
```python
dtrain = xgboost.DMatrix('train.libsvm?indexing_mode=1') # use 1-based indexing
dtest = xgboost.DMatrix('test.libsvm') # use 0-based indexing (default)
dtest2 = xgboost.DMatrix('test2.libsvm?indexing_mode=-1') # use heuristic to detect 0-based / 1-based
```
* Fix a bug in float parsing (issue dmlc/dmlc-core#440)
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider spark.task.cpus when controlling parallelism
* fix bug
* fix conf setup
* calculate requestedCores within ParallelismController
* enforce spark.task.cpus = 1
* unify unit test case framework
* enable spark ui
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider missing value in prediction
* handle single prediction instance
* fix type conversion
* Fix bug of using list(x) function when x is string
list('abcdcba') = ['a', 'b', 'c', 'd', 'c', 'b', 'a']
* Allow feature_names/feature_types to be of any type
If feature_names/feature_types is iterable, e.g. tuple, list, then convert the value to list, except for string; otherwise construct a list with a single value
* Delete excess whitespace
* Fix whitespace to pass lint