This pull request amends the broken #3062 allow Spark 2.2 to work.
Please note this won't work in Spark <=2.1 as sc.removeSparkListener was implemented in Spark 2.2. (So perhaps a more general method is better, although that is what was attempted in #3062)
This PR fixes: #3208, #3151 and the discussion in #1927.
I do find it strange that #3062 dose not work in Spark 2.2, it's probably due to some sort of public/private issue in the org.apache.spark.scheduler.LiveListenerBus class inheritance (In Spark itself). The error is: `java.lang.NoSuchMethodError: org.apache.spark.scheduler.LiveListenerBus.removeListener(Ljava/lang/Object;)V`
* Adding Java/Scala doc build to Jenkins CI
* Deploy built doc to S3 bucket
* Build doc only for branches
* Build doc first, to get doc faster for branch updates
* Have ReadTheDocs download doc tarball from S3
* Update JVM doc links
* Put doc build commands in a script
* Specify Spark 2.3+ requirement for XGBoost4J-Spark
* Build GPU wheel without NCCL, to reduce binary size
* Revert "Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)"
This reverts commit 44811f2330.
* Document behavior of predict() for DART booster
* Add notice to parameter.rst
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* partial finish
* no test
* add test cases
* add test cases
* address comments
* add test for regressor
* fix typo
* Fix#3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows
Description: The bug is triggered when
1. The data matrix has empty rows at the bottom. More precisely, the rows
`n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.
Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.
Fix: Modify the row pointer.
* Add regression test
The base margin will need to have length `[num_class] * [number of data points]`.
Otherwise, the array holding prediction results will be only partially
initialized, causing undefined behavior.
Fix: check the length of the base margin. If the length is not correct,
use the global bias (`base_score`) instead. Warn the user about the
substitution.
* Fix#3402: wrong fid crashes distributed algorithm
The bug was introduced by the recent DMatrix refactor (#3301). It was partially
fixed by #3408 but the example in #3402 was still failing. The example in #3402
will succeed after this fix is applied.
* Explicitly specify "this" to prevent compile error
* Add regression test
* Add distributed test to Travis matrix
* Install kubernetes Python package as dependency of dmlc tracker
* Add Python dependencies
* Add compile step
* Reduce size of regression test case
* Further reduce size of test
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add new
* update doc
* finish Gang Scheduling
* more
* intro
* Add sections: Prediction, Model persistence and ML pipeline.
* Add XGBoost4j-Spark MLlib pipeline example
* partial finished version
* finish the doc
* adjust code
* fix the doc
* use rst
* Convert XGBoost4J-Spark tutorial to reST
* Bring XGBoost4J up to date
* add note about using hdfs
* remove duplicate file
* fix descriptions
* update doc
* Wrap HDFS/S3 export support as a note
* update
* wrap indexing_mode example in code block
This bring many goodies, including:
* Ability to specify delimiter and weight_column for CSV files:
```python
dtrain = xgboost.DMatrix('train.csv?format=csv&label_column=0&weight_column=1&delimiter= ')
```
* Ability to choose between 0-based and 1-based indexing for LIBSVM/LIBFM files:
```python
dtrain = xgboost.DMatrix('train.libsvm?indexing_mode=1') # use 1-based indexing
dtest = xgboost.DMatrix('test.libsvm') # use 0-based indexing (default)
dtest2 = xgboost.DMatrix('test2.libsvm?indexing_mode=-1') # use heuristic to detect 0-based / 1-based
```
* Fix a bug in float parsing (issue dmlc/dmlc-core#440)
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider spark.task.cpus when controlling parallelism
* fix bug
* fix conf setup
* calculate requestedCores within ParallelismController
* enforce spark.task.cpus = 1
* unify unit test case framework
* enable spark ui
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* consider missing value in prediction
* handle single prediction instance
* fix type conversion
* Fix bug of using list(x) function when x is string
list('abcdcba') = ['a', 'b', 'c', 'd', 'c', 'b', 'a']
* Allow feature_names/feature_types to be of any type
If feature_names/feature_types is iterable, e.g. tuple, list, then convert the value to list, except for string; otherwise construct a list with a single value
* Delete excess whitespace
* Fix whitespace to pass lint
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)
* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.
* Fix appveyor R test
* Save max_delta_step as an extra attribute of Booster
Fixes#3509 and #3026, where `max_delta_step` parameter gets lost during serialization.
* fix lint
* Use camel case for global constant
* disable local variable case in clang-tidy
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
Add `'total_gain'` and `'total_cover'` as possible `importance_type`
arguments to `Booster.get_score` in the Python package.
`get_score` already accepts a `'gain'` argument, which returns each
feature's average gain over all of its splits. `'total_gain'` does the
same, but returns a total rather than an average. This seems more
intuitively meaningful, and also matches the behavior of the R package's
`xgb.importance` function.
I also added an analogous `'total_cover'` command for consistency.
This should resolve#3484.
* Improved library loading a bit
* Fixed indentation.
* Fixes according to the discussion
* Moved the comment to a separate line.
* specified exception type
* Change doc build to reST exclusively
* Rewrite Intro doc in reST; create toctree
* Update parameter and contribute
* Convert tutorials to reST
* Convert Python tutorials to reST
* Convert CLI and Julia docs to reST
* Enable markdown for R vignettes
* Done migrating to reST
* Add guzzle_sphinx_theme to requirements
* Add breathe to requirements
* Fix search bar
* Add link to user forum
* Fail GPU CI after test failure
* Fix GPU linear tests
* Reduced number of GPU tests to speed up CI
* Remove static allocations of device memory
* Resolve illegal memory access for updater_fast_hist.cc
* Fix broken r tests dependency
* Update python install documentation for GPU
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* disable booster setup in spark
* check in parameter conversion
* fix compilation issue
* update exception type
* add qid for https://github.com/dmlc/xgboost/issues/2748
* change names
* change spaces
* change qid to bst_uint type
* change qid type to size_t
* change qid first to SIZE_MAX
* change qid type from size_t to uint64_t
* update dmlc-core
* fix qids name error
* fix group_ptr_ error
* Style fix
* Add qid handling logic to SparsePage
* New MetaInfo format + backward compatibility fix
Old MetaInfo format (1.0) doesn't contain qid field. We still want to be able
to read from MetaInfo files saved in old format. Also, define a new format
(2.0) that contains the qid field. This way, we can distinguish files that
contain qid and those that do not.
* Update MetaInfo test
* Simply group assignment logic
* Explicitly set qid=nullptr in NativeDataIter
NativeDataIter's callback does not support qid field. Users of NativeDataIter
will need to call setGroup() function separately to set group information.
* Save qids_ in SaveBinary()
* Upgrade dmlc-core submodule
* Add a test for reading qid
* Add contributor
* Check the size of qids_
* Document qid format
* allow arbitrary cross validation fold indices
- use training indices passed to `folds` parameter in `training.cv`
- update doc string
* add tests for arbitrary fold indices
* Refactor to allow for custom regularisation methods
* Implement compositional SplitEvaluator framework
* Fixed segfault when no monotone_constraints are supplied.
* Change pid to parentID
* test_monotone_constraints.py now passes
* Refactor ColMaker and DistColMaker to use SplitEvaluator
* Performance optimisation when no monotone_constraints specified
* Fix linter messages
* Fix a few more linter errors
* Update the amalgamation
* Add bounds check
* Add check for leaf node
* Fix linter error in param.h
* Fix clang-tidy errors on CI
* Fix incorrect function name
* Fix clang-tidy error in updater_fast_hist.cc
* Enable SSE2 for Win32 R MinGW
Addresses https://github.com/dmlc/xgboost/pull/3335#issuecomment-400535752
* Add contributor
CI tests were failing because wget prompts "the user" for a response
whenever the google test archive is already on the disk.
Fix: Use `-nc` option to skip download when the archive already
exists
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* maven central release
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* [jvm-packages] XGBoost Spark integration refactor. (#3313)
* XGBoost Spark integration refactor.
* Make corresponding update for xgboost4j-example
* Address comments.
* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)
* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib
* Fix extra space.
* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)
* XGBoost Spark supports ranking with group data.
* Use Iterator.duplicate to prevent OOM.
* Update CheckpointManagerSuite.scala
* Resolve conflicts
* Use sparse page as singular CSR matrix representation
* Simplify dmatrix methods
* Reduce statefullness of batch iterators
* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
* GPU binning and compression.
- binning and index compression are done inside the DeviceShard constructor
- in case of a DMatrix with multiple row batches, it is first converted into a single row batch
Currently, `CLIPredict()` saves prediction results in default 6-digit precision which causes precision loss. This PR sets precision to a level so that the conversion back to `bst_float` is lossless.
Related: #3298.
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update 0.80
* Fix print.xgb.Booster
valid_handle should be TRUE when x$handle is NOT null
* Update xgb.Booster.R
Modify is.null.handle to return TRUE for NULL handle
* Add option to use weights when evaluating metrics in validation sets
* Add test for validation-set weights functionality
* simplify case with no weights for test sets
* fix lint issues
* For CRAN submission, remove all #pragma's that suppress compiler warnings
A few headers in dmlc-core contain #pragma's that disable compiler warnings,
which is against the CRAN submission policy. Fix the problem by removing
the offending #pragma's as part of the command `make Rbuild`.
This addresses issue #3322.
* Fix script to improve Cygwin/MSYS compatibility
We need this to pass rmingw CI test
* Remove remove_warning_suppression_pragma.sh from packaged tarball
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* static glibc glibc++
* update to build with glib 2.12
* remove unsupported flags
* update version number
* remove properties
* remove unnecessary command
* update poms
* Update dmlc-core submodule
* Fix dense_parser to work with the latest dmlc-core
* Specify location of Google Test
* Add more source files in dmlc-minimum to get latest dmlc-core working
* Update dmlc-core submodule
* Adjust xgboost entries in .gitignore
They were overly broad. In particularly this was inconvenient when
working with tools such as fzf that use the .gitignore to decide what to
include. As written, we'd not look into /include/xgboost.
* Make cosmetic improvements to .gitignore
* Remove dmlc-core from .gitignore
This seems unnecessary and has the drawback that tools that use
.gitignore to know files to skip mean they won't look here, and being
able to inspect the submodule files with them is useful.
* Increase precision of bst_float values in tree dumps
* Increase precision of bst_float values in tree dumps
* Fix lint error and switch precision to right float variable
* Fix clang-tidy error
* Multi-GPU HostDeviceVector.
- HostDeviceVector instances can now span multiple devices, defined by GPUSet struct
- the interface of HostDeviceVector has been modified accordingly
- GPU objective functions are now multi-GPU
- GPU predicting from cache is now multi-GPU
- avoiding omp_set_num_threads() calls
- other minor changes
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* change version of jvm to keep consistent with other pkgs
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update default spark version to 2.3
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add back libsvm notes
* Now `make pippack` works without any manual action: it will produce
xgboost-[version].tar.gz, which one can use by typing
`pip3 install xgboost-[version].tar.gz`.
* Detect OpenMP-capable compilers (clang, gcc-5, gcc-7) on MacOS
* rank_metric: add AUC-PR
Implementation of the AUC-PR calculation for weighted data, proposed by Keilwagen, Grosse and Grau (https://doi.org/10.1371/journal.pone.0092209)
* rank_metric: fix lint warnings
* Implement tests for AUC-PR and fix implementation
* add aucpr to documentation for other languages
* fix rebase conflict
* [core] additional gblinear improvements
* [R] callback for gblinear coefficients history
* force eta=1 for gblinear python tests
* add top_k to GreedyFeatureSelector
* set eta=1 in shotgun test
* [core] fix SparsePage processing in gblinear; col-wise multithreading in greedy updater
* set sorted flag within TryInitColData
* gblinear tests: use scale, add external memory test
* fix multiclass for greedy updater
* fix whitespace
* fix typo
* Extended monotonic constraints support to 'hist' tree method.
* Added monotonic constraints tests.
* Fix the signature of NoConstraint::CalcSplitGain()
* Document monotonic constraint support in 'hist'
* Update signature of Update to account for latest refactor
* Support CSV file in DMatrix
We'd just need to expose the CSV parser in dmlc-core to the Python wrapper
* Revert extra code; document existing CSV support
CSV support is already there but undocumented
* Add notice about categorical features
* Replaced std::vector-based interfaces with HostDeviceVector-based interfaces.
- replacement was performed in the learner, boosters, predictors,
updaters, and objective functions
- only interfaces used in training were replaced;
interfaces like PredictInstance() still use std::vector
- refactoring necessary for replacement of interfaces was also performed,
such as using HostDeviceVector in prediction cache
* HostDeviceVector-based interfaces for custom objective function example plugin.
* Fix doc build
ReadTheDocs build has been broken for a while due to incompatibilities between
commonmark, recommonmark, and sphinx. See:
* "Recommonmark not working with Sphinx 1.6"
https://github.com/rtfd/recommonmark/issues/73
* "CommonMark 0.6.0 breaks compatibility"
https://github.com/rtfd/recommonmark/issues/24
For now, we fix the versions to get the build working again
* Fix search bar
* added mingw64 installation instruction, and library file copy.
* Change all `libxgboost.dll` to `xgboost.dll`
On Windows, the library file is called `xgboost.dll`, not `libxgboost.dll` as in the build doc previously
In line 461, the "size_t offset = 0;" should be declared before any calculation, otherwise will cause compilation error.
```
I:\Libraries\xgboost\src\c_api\c_api.cc(416): error C2146: Missing ";" before "offset" [I:\Libraries\xgboost\build\objxgboost.vcxproj]
```
* Add interaction effects and cox loss
* Minimize whitespace changes
* Cox loss now no longer needs a pre-sorted dataset.
* Address code review comments
* Remove mem check, rename to pred_interactions, include bias
* Make lint happy
* More lint fixes
* Fix cox loss indexing
* Fix main effects and tests
* Fix lint
* Use half interaction values on the off-diagonals
* Fix lint again
- thrust::copy() called from dvec::copy() for gpairs invoked a GPU kernel instead of
cudaMemcpy()
- this resulted in illegal memory access if the GPU running the kernel could not access
the data being copied
- new version of dvec::copy() for thrust::device_ptr iterators calls cudaMemcpy(),
avoiding the problem.
* Added GPU objective function and no-copy interface.
- xgboost::HostDeviceVector<T> syncs automatically between host and device
- no-copy interfaces have been added
- default implementations just sync the data to host
and call the implementations with std::vector
- GPU objective function, predictor, histogram updater process data
directly on GPU
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* tiny fix for empty partition in predict
* further fix
* [jvm-packages] Prevent dispose being called twice when finalize
* Convert SIGSEGV to XGBoostError
* Avoid creating a new SBooster with the same JBooster
* Address CR Comments
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix the pattern in dev script and version mismatch
After installing ``gcc@5``, ``CMAKE_C_COMPILER`` will not be set to gcc-5 in some macOS environment automatically and the installation of xgboost will still fail. Manually setting the compiler will solve the problem.
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add dev script to update version and update versions
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update resource files
* Update SparkParallelismTracker.scala
* remove xgboost-tracker.properties
* [jvm-packages] Train Booster from an existing model
* Align Scala API with Java API
* Existing model should not load rabit checkpoint
* Address minor comments
* Implement saving temporary boosters and loading previous booster
* Add more unit tests for loadPrevBooster
* Add params to XGBoostEstimator
* (1) Move repartition out of the temp model saving loop (2) Address CR comments
* Catch a corner case of training next model with fewer rounds
* Address comments
* Refactor newly added methods into TmpBoosterManager
* Add two files which is missing in previous commit
* Rename TmpBooster to checkpoint
* [jvm-packages] Fixed test/train persistence
Prior to this patch both data sets were persisted in the same directory,
i.e. the test data replaced the training one which led to
* training on less data (since usually test < train) and
* test loss being exactly equal to the training loss.
Closes#2945.
* Cleanup file cache after the training
* Addressed review comments
* [R] fix finding R.exe with cmake on WIN when it is in PATH
* [R] appveyor config for R package
* [R] wrap the lines to make R check happier
* [R] install only binary dep-packages in appveyor
* [R] for MSVC appveyor, also build a binary for R package and keep as an artifact
* [R] fix predict contributions for data with no colnames
* [R] add a render parameter for xgb.plot.multi.trees; fixes#2628
* [R] update Rd's
* [R] remove unnecessary dep-package from R cmake install
* silence type warnings; readability
* [R] silence complaint about incomplete line at the end
* [R] initial version of xgb.plot.shap()
* [R] more work on xgb.plot.shap
* [R] enforce black font in xgb.plot.tree; fixes#2640
* [R] if feature names are available, check in predict that they are the same; fixes#2857
* [R] cran check and lint fixes
* remove tabs
* [R] add references; a test for plot.shap
* Fix#2905
* Fix gpu_exact test failures
* Fix bug in GPU prediction where multiple calls to batch prediction can produce incorrect results
* Fix GPU documentation formatting
* Some minor changes to the code style
Some minor changes to the code style in file basic_walkthrough.py
* coding style changes
* coding style changes arrcording PEP8
* Update basic_walkthrough.py
* Fix minor typo
* Minor edits to coding style
Minor edits to coding style following the proposals of PEP8.
* [jvm-packages] Exposed train-time evaluation metrics
They are accessible via 'XGBoostModel.summary'. The summary is not
serialized with the model and is only available after the training.
* Addressed review comments
* Extracted model-related tests into 'XGBoostModelSuite'
* Added tests for copying the 'XGBoostModel'
* [jvm-packages] Fixed a subtle bug in train/test split
Iterator.partition (naturally) assumes that the predicate is deterministic
but this is not the case for
r.nextDouble() <= trainTestRatio
therefore sometimes the DMatrix(...) call got a NoSuchElementException
and crashed the JVM due to lack of exception handling in
XGBoost4jCallbackDataIterNext.
* Make sure train/test objectives are different
I found the installation of the Python XGBoost package to be problematic as the documentation around compiler requirements was unclear, as discussed in #1501. I decided that I would improve the README.
- Implement colsampling, subsampling for gpu_hist_experimental
- Optimised multi-GPU implementation for gpu_hist_experimental
- Make nccl optional
- Add Volta architecture flag
- Optimise RegLossObj
- Add timing utilities for debug verbose mode
- Bump required cuda version to 8.0
In the refactor to add base margins, #2532, all of the labels were lost
when creating the dmatrix. This became obvious as metrics like ndcg
always returned 1.0 regardless of the results.
Change-Id: I88be047e1c108afba4784bd3d892bfc9edeabe55
Training a model with the experimental rank:ndcg objective incorrectly
returns a Classification model. Adjust the classification check to
not recognize rank:* objectives as classification.
While writing tests for isClassificationTask also turned up that
obj_type -> regression was incorrectly identified as a classification
task so the function was slightly adjusted to pass the new tests.
* Some minor changes to the code style
Some minor changes to the code style in file basic_walkthrough.py
* coding style changes
* coding style changes arrcording PEP8
* Update basic_walkthrough.py
* Fatal error if GPU algorithm selected without GPU support compiled
* Resolve type conversion warnings
* Fix gpu unit test failure
* Fix compressed iterator edge case
* Fix python unit test failures due to flake8 update on pip
Problem:
Fast histogram updater crashes whenever subsampling picks zero rows
Diagnosis:
Row set data structure uses "nullptr" internally to indicate a non-existent
row set. Since you cannot take the address of the first element of an empty
vector, a valid row set ends up getting "nullptr" as well.
Fix:
Use an arbitrary value (not equal to "nullptr") to bypass nullptr check.
* Only set OpenMP_CXX_FLAGS when OpenMP is found
I found this trying to get the Mac build working without OpenMP. Tips in
issue #2596 helped to point in the right direction.
* Revise check
* Trigger codecov
* Add SparkParallelismTracker to prevent job from hanging
* Code review comments
* Code Review Comments
* Fix unit tests
* Changes and unit test to catch the corner case.
* Update documentations
* Small improvements
* cancalAllJobs is problematic with scalatest. Remove it
* Code Review Comments
* Check number of executor cores beforehand, and throw exeception if any core is lost.
* Address CR Comments
* Add missing class
* Fix flaky unit test
* Address CR comments
* Remove redundant param for TaskFailedListener
* SHAP values for feature contributions
* Fix commenting error
* New polynomial time SHAP value estimation algorithm
* Update API to support SHAP values
* Fix merge conflicts with updates in master
* Correct submodule hashes
* Fix variable sized stack allocation
* Make lint happy
* Add docs
* Fix typo
* Adjust tolerances
* Remove unneeded def
* Fixed cpp test setup
* Updated R API and cleaned up
* Fixed test typo
* coding style update
Current coding style varies(for example: the mixed use of single quote and double quote), and it will be confusing, especially for new users.
This PR will try to follow proposal of PEP8, make the documents more readable.
* minor fix
* Allowed subsampling test from the training data frame/RDD
The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.
An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.
* Removed duplication from 'XGBoost.train'
Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.
* Reuse XGBoost seed parameter to stabilize train/test splitting
* Added early stopping support to non-distributed XGBoost
Closes#1544
* Added early-stopping to distributed XGBoost
* Moved construction of 'watches' into a separate method
This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.
* Addressed review comments
* [R] MSVC compatibility
* [GPU] allow seed in BernoulliRng up to size_t and scale to uint32_t
* R package build with cmake and CUDA
* R package CUDA build fixes and cleanups
* always export the R package native initialization routine on windows
* update the install instructions doc
* fix lint
* use static_cast directly to set BernoulliRng seed
* [R] demo for GPU accelerated algorithm
* tidy up the R package cmake stuff
* R pack cmake: installs main dependency packages if needed
* [R] version bump in DESCRIPTION
* update NEWS
* added short missing/sparse values explanations to FAQ
Current version of xgboost.readthedocs.io has a broken search box.
Enabling themes on ReadTheDocs is known to break the search function, as
reported in
[this document](https://github.com/rtfd/readthedocs.org/issues/1487). To get
around the bug, we replace the `searchtools.js` file with our custom version.
* Removal of redundant code/files.
* Removal of exact namespace in GPU plugin
* Revert double precision histograms to single precision for performance on Maxwell/Kepler
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala
This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.
I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.
Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.
* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs
Note that DataFrame-based and Flink APIs are not affected by this change.
* Removed baseMargin argument in favour of the LabeledPoint field
* Do a single pass over the partition in buildDistributedBoosters
Note that there is no formal guarantee that
val repartitioned = rdd.repartition(42)
repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }
would do a single shuffle, but in practice it seems to be always the case.
* Exposed baseMargin in DataFrame-based API
* Addressed review comments
* Pass baseMargin to XGBoost.trainWithDataFrame via params
* Reverted MLLabeledPoint in Spark APIs
As discussed, baseMargin would only be supported for DataFrame-based APIs.
* Cleaned up baseMargin tests
- Removed RDD-based test, since the option is no longer exposed via
public APIs
- Changed DataFrame-based one to check that adding a margin actually
affects the prediction
* Pleased Scalastyle
* Addressed more review comments
* Pleased scalastyle again
* Fixed XGBoost.fromBaseMarginsToArray
which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
* repared serialization after update process; fixes#2545
* non-stratified folds in python could omit some data instances
* Makefile: fixes for older makes on windows; clean R-package too
* make cub to be a shallow submodule
* improve $(MAKE) recovery
* for MinGW, drop the 'lib' prefix from shared library name
* fix defines for 'g++ 4.8 or higher' to include g++ >= 5
* fix compile warnings
* [Appveyor] add MinGW with python; remove redundant jobs
* [Appveyor] also do python build for one of msvc jobs
* Deduplicated DataFrame creation in XGBoostDFSuite
* Extracted dermatology.data into MultiClassification
* Moved cache cleaning to SharedSparkContext
Cache files are prefixed with appName therefore this seems to be just the
place to delete them.
* Removed redundant JMatrix calls in xgboost4j-spark
* Slightly more readable buildDenseRDD in XGBoostGeneralSuite
* Generalized train/test DataFrame construction in XGBoostDFSuite
* Changed SharedSparkContext to setup a new context per-test
Hence the new name: PerTestSparkSession :)
* Fused Utils into PerTestSparkSession
* Whitespace fix in XGBoostDFSuite
* Ensure SparkSession is always eagerly created in PerTestSparkSession
* Renamed PerTestSparkSession->PerTest
because it was doing slightly more than creating/stopping the session.
Includes:
- Dockerfile changes
- Dockerfile clean up
- Fix execution privileges of files used from Dockerfile.
- New Dockerfile entrypoint to replace with_user script
- Defined a placeholders for CPU testing (script and Dockerfile)
- Jenkinsfile
- Jenkins file milestone defined
- Single source code checkout and propagation via stash/unstash
- Bash needs to be explicitly used in launching make build, since we need
access to environment
- Jenkinsfile build factory for cmake and make style of jobs
- Archivation of artifacts (*.so, *.whl, *.egg) produced by cmake build
Missing:
- CPU testing
- Python3 env build and testing
* [jvm-packages] Deduplicated train/test data access in tests
All datasets are now available via a unified API, e.g. Agaricus.test.
The only exception is the dermatology data which requires parsing a
CSV file.
* Inlined Utils.buildTrainingRDD
The default number of partitions for local mode is equal to the number
of available CPUs.
* Replaced dataset names with problem types
It has been reported that new parallel algorithm (#2493) results in excessive
message usage (see issue #2326). Until issues are resolved, XGBoost should use
the old parallel algorithm by default. The user would have to specify
`enable_feature_grouping=1` manually to enable the new algorithm.
* Patch to improve multithreaded performance scaling
Change parallel strategy for histogram construction.
Instead of partitioning data rows among multiple threads, partition feature
columns instead. Useful heuristics for assigning partitions have been adopted
from LightGBM project.
* Add missing header to satisfy MSVC
* Restore max_bin and related parameters to TrainParam
* Fix lint error
* inline functions do not require static keyword
* Feature grouping algorithm accepting FastHistParam
Feature grouping algorithm accepts many parameters (3+), and it gets annoying to
pass them one by one. Instead, simply pass the reference to FastHistParam. The
definition of FastHistParam has been moved to a separate header file to
accomodate this change.
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified. See discussion in 06bd5dca for motivation.
* Disabled excessive Spark logging in tests
* Fixed a singature of XGBoostModel.predict
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.
* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints
* Inlined XGBoost.repartitionData
An if is more explicit than an opaque method name.
* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel
* Check the input dimension in DMatrix.setBaseMargin
Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?
* Reduced nesting in XGBoost.buildDistributedBoosters
* Ensured consistent naming of the params map
* Cleaned up DataBatch to make it easier to comprehend
* Made scalastyle happy
* Added baseMargin to XGBoost.train and trainWithRDD
* Deprecated XGBoost.train
It is ambiguous and work only for RDDs.
* Addressed review comments
* Revert "Fixed a singature of XGBoostModel.predict"
This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.
* Addressed more review comments
* Fixed NullPointerException in buildDistributedBoosters
* Fixed DLL name on Windows in ``xgboost.libpath``
* Added support for OS X to ``xgboost.libpath``
* Use .dylib for shared library on OS X
This does not affect the JNI library, because it is not trully
cross-platform in the Makefile-build anyway.
* Exposed prediction feature contribution on the Java side
* was not supplying the newly added argument
* Exposed from Scala-side as well
* formatting (keep declaration in one line unless exceeding 100 chars)
* [jvm-packages] Ensure the native library is loaded once
Previously any class using XGBoostJNI queried NativeLibLoader to make
sure the native library is loaded. This commit moves the initXGBoost
call to XGBoostJNI, effectively delegating the initialization to the class
loader.
Note also, that now XGBoostJNI would NOT suppress an IOException if it
occured in initXGBoost.
* [jvm-packages] Fused JNIErrorHandle with XGBoostJNI
There was no reason for having a separate class.
When using xgboost4j-spark I had executors getting killed much more
often than i would expect by yarn for overrunning their memory limits,
based on the memoryOverhead provided. It looks like a significant
amount of this is because dmatrix's were being created but not released,
because they were only released when the GC decided it was time to
cleanup the references.
Rather than waiting for the GC, relesae the DMatrix's when we know
they are no longer necessary.
* [jvm-packages] Fixed compilation on Windows
* [jvm-packages] Build the JNI bindings on Appveyor
* [jvm-packages] Build & test on OS X
* [jvm-packages] Re-applied the CMake build changes reverted by #2395
* Fixed Appveyor JVM build
* Muted Maven on Travis
* Don't link with libawt
* "linux2"->"linux"
Python2.x and 3.X use slightly different values for ``sys.platform``.
* Support for builing gpu-plugins to specific GPU architectures
1. Option GPU_COMPUTE_VER exposed from both Makefile and CMakeLists.txt
2. updater_gpu documentation updated accordingly
* Re-introduced GPU_COMPUTE_VER option in the cmake flow.
This seems to fix the compile-time, rdc=true and copy-constructor related
errors seen and discussed in PR #2390.
* [jvm-packages] Fixed JNI_OnLoad overload
It does not compile on Windows without proper export flags.
* [jvm-packages] Use JNI types directly where appropriate
* Removed lib hack from CMake build
Prior to this commit the CMake build use hardcoded lib prefix for
libxgboost and libxgboost4j. Unfortunatelly this did not play well with
Windows, which does not use the lib- prefix.
* [jvm-packages] Replaced create_jni.{bat,sh} with a Python version
This allows to have a single script for all platforms.
* [jvm-packages] Added all configuration options to create_jni.py
Use int32_t explicitly when serializing version field of dmatrix in binary
format. On ILP64 architectures, although very little, size of int is 64 bits.
* Integrating a faster version of grow_gpu plugin
1. Removed the older files to reduce duplication
2. Moved all of the grow_gpu files under 'exact' folder
3. All of them are inside 'exact' namespace to avoid any conflicts
4. Fixed a bug in benchmark.py while running only 'grow_gpu' plugin
5. Added cub and googletest submodules to ease integration and unit-testing
6. Updates to CMakeLists.txt to directly build cuda objects into libxgboost
* Added support for building gpu plugins through make flow
1. updated makefile and config.mk to add right targets
2. added unit-tests for gpu exact plugin code
* 1. Added support for building gpu plugin using 'make' flow as well
2. Updated instructions for building and testing gpu plugin
* Fix travis-ci errors for PR#2360
1. lint errors on unit-tests
2. removed googletest, instead depended upon dmlc-core provide gtest cache
* Some more fixes to travis-ci lint failures PR#2360
* Added Rory's copyrights to the files containing code from both.
* updated copyright statement as per Rory's request
* moved the static datasets into a script to generate them at runtime
* 1. memory usage print when silent=0
2. tests/ and test/ folder organization
3. removal of the dependency of googletest for just building xgboost
4. coding style updates for .cuh as well
* Fixes for compilation warnings
* add cuda object files as well when JVM_BINDINGS=ON
* [jvm-packages] Added libxgboost4j to CMake build
* [jvm-packages] Wired CMake build into create_jni.sh
* User newer CMake version on Travis
* Lowered CMake version constraints
* Fixed various quirks in the new CMake build
Don't use implicit conversions to c_int, which incidentally happen to work
on (some) 64-bit platforms, but:
* may lead to truncation of the input value to a 32-bit signed int,
* cause segfaults on some 32-bit architectures (tested on Ubuntu ARM,
but is also the likely cause of issue #1707).
Also, when passing references use explicit 64-bit integers, where needed,
instead of c_ulong, which is not guaranteed to be this large.
* Specified 'exec-maven-plugin' version
* Changed 'create_jni.sh' to fail on error
and also report each of the executed commands, which makes it easier
to debug.
for loop in create.new.tree.features was referencing length(trees) as the upper bound of the loop. trees is a base R dataset and not the model that the code is generating. Changed loop boundary to model$niter which should be the number of trees.
* Added kwargs support for Sklearn API
* Updated NEWS and CONTRIBUTORS
* Fixed CONTRIBUTORS.md
* Added clarification of **kwargs and test for proper usage
* Fixed lint error
* Fixed more lint errors and clf assigned but never used
* Fixed more lint errors
* Fixed more lint errors
* Fixed issue with changes from different branch bleeding over
* Fixed issue with changes from other branch bleeding over
* Added note that kwargs may not be compatible with Sklearn
* Fixed linting on kwargs note
* Added n_jobs and random_state to keep up to date with sklearn API.
Deprecated nthread and seed. Added tests for new params and
deprecations.
* Fixed docstring to reflect updates to n_jobs and random_state.
* Fixed whitespace issues and removed nose import.
* Added deprecation note for nthread and seed in docstring.
* Attempted fix of deprecation tests.
* Second attempted fix to tests.
* Set n_jobs to 1.
* [gblinear] add features contribution prediction; fix DumpModel bug
* [gbtree] minor changes to PredContrib
* [R] add feature contribution prediction to R
* [R] bump up version; update NEWS
* [gblinear] fix the base_margin issue; fixes#1969
* [R] list of matrices as output of multiclass feature contributions
* [gblinear] make order of DumpModel coefficients consistent: group index changes the fastest
* Fix compilation on OS X with GCC 7
Compilation failed with
In file included from src/tree/tree_updater.cc:6:0:
include/xgboost/tree_updater.h:75:46: error: 'function' is not a member of 'std'
std::function<TreeUpdater* ()> > {
caused by a missing <functional> include.
* Fixed another occurence of that issue spotted by @ClimberPG
* Add option to choose booster in scikit intreface (gbtree by default)
* Add option to choose booster in scikit intreface: complete docstring.
* Fix XGBClassifier to work with booster option
* Added test case for gblinear booster
* [R] add native routines registration
* c_api.h needs to include <cstdint> since it uses fixed width integer types
* [R] use registered native routines from R code
* [R] bump version; add info on native routine registration to the contributors guide
* make lint happy
* Add prediction of feature contributions
This implements the idea described at http://blog.datadive.net/interpreting-random-forests/
which tries to give insight in how a prediction is composed of its feature contributions
and a bias.
* Support multi-class models
* Calculate learning_rate per-tree instead of using the one from the first tree
* Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly
* Add simple test for contributions feature
* Check against param.num_nodes instead of checking for non-zero length
* Loop over all roots instead of only the first
* add back train method but mark as deprecated
* fix scalastyle error
* fix the persistence of XGBoostEstimator
* test persistence of a complete pipeline
* fix compilation issue
* do not allow persist custom_eval and custom_obj
* fix the failed tesl
* [R] make sure things work for a single split model; fixes#2191
* [R] add option use_int_id to xgb.model.dt.tree
* [R] add example of exporting tree plot to a file
* [R] set save_period = NULL as default in xgboost() to be the same as in xgb.train; fixes#2182
* [R] it's a good practice after CRAN releases to bump up package version in dev
* [R] allow xgb.DMatrix construction from integer dense matrices
* [R] xgb.DMatrix: silent parameter; improve documentation
* [R] xgb.model.dt.tree code style changes
* [R] update NEWS with parameter changes
* [R] code safety & style; handle non-strict matrix and inherited classes of input and model; fixes#2242
* [R] change to x.y.z.p R-package versioning scheme and set version to 0.6.4.3
* [R] add an R package versioning section to the contributors guide
* [R] R-package/README.md: clean up the redundant old installation instructions, link the contributors guide
Reported in issue #2165. Dynamic scheduling of OpenMP loops involve
implicit synchronization. To implement synchronization, libgomp uses futex
(fast userspace mutex), whereas MinGW uses kernel-space mutex, which is more
costly. With chunk size of 1, synchronization overhead may become prohibitive
on Windows machines.
Solution: use 'guided' schedule to minimize the number of syncs
Storing and then loading a model loses any eval_metric that was
provided. This causes implementations that always store/load, like
xgboost4j-spark, to be unable to eval with the desired metric.
This log appears to fire every time I ask the python package to make a prediction. It's the only log that fires from XGBoost. When we're getting predictions on millions of items a day in production, this log seems out of place.
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* small fix for cleanExternalCache
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* fix several issues in tests
* Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData()
When there are more threads than rows in rowset, some threads end up
with empty ranges, causing them to crash. (iend - 1 needs to be
accessible as part of algorithm)
Fix: run only those threads with nonempty ranges.
* Add regression test for Bugfix 1
* Moving python_omp_test to existing python test group
Turns out you don't need to set "OMP_NUM_THREADS" to enable
multithreading. Just add nthread parameter.
* Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature
When split value is less than all cut points, split_cond is set
incorrectly.
Fix: set split_cond = -1 to indicate this scenario
* Bugfix 3: Initialize data layout indicator before using it
data_layout_ is accessed before being set; this variable determines
whether feature 0 is included in feat_set.
Fix: re-order code in InitData() to initialize data_layout_ first
* Adding regression test for Bugfix 2
Unfortunately, no regression test for Bugfix 3, as there is no
way to deterministically assign value to an uninitialized variable.
* Add UpdatePredictionCache() option to updaters
Some updaters (e.g. fast_hist) has enough information to quickly compute
prediction cache for the training data. Each updater may override
UpdaterPredictionCache() method to update the prediction cache. Note: this
trick does not apply to validation data.
* Respond to code review
* Disable some debug messages by default
* Document UpdatePredictionCache() interface
* Remove base_margin logic from UpdatePredictionCache() implementation
* Do not take pointer to cfg, as reference may get stale
* Improve multi-threaded performance
* Use columnwise accessor to accelerate ApplySplit() step,
with support for a compressed representation
* Parallel sort for evaluation step
* Inline BuildHist() function
* Cache gradient pairs when building histograms in BuildHist()
* Add missing #if macro
* Respond to code review
* Use wrapper to enable parallel sort on Linux
* Fix C++ compatibility issues
* MSVC doesn't support unsigned in OpenMP loops
* gcc 4.6 doesn't support using keyword
* Fix lint issues
* Respond to code review
* Fix bug in ApplySplitSparseData()
* Attempting to read beyond the end of a sparse column
* Mishandling the case where an entire range of rows have missing values
* Fix training continuation bug
Disable UpdatePredictionCache() in the first iteration. This way, we can
accomodate the scenario where we build off of an existing (nonempty) ensemble.
* Add regression test for fast_hist
* Respond to code review
* Add back old version of ApplySplitSparseData
* Updated sklearn_parallel.py for soon-to-be-deprecated modules
* Updated predict_leaf_indices.py; Use python3 print() as other exmaples and removed unused module
This commit proposes a simpler single compiler specification for OSX and *nix. It also let's people override the setting on both systems, not just *nix.
I use the online prediction function(`inline void Predict(const SparseBatch::Inst &inst, ... ) const;`), the results obtained are different from the results of the batch prediction function(` virtual void Predict(DMatrix* data, ...) const = 0`). After the investigation found that the online prediction function using the `base_score_` parameters, and the batch prediction function is not used in this parameter. It is found that the `base_score_` values are different when the same model file is loaded many times.
```
1st times:base_score_: 6.69023e-21
2nd times:base_score_: -3.7668e+19
3rd times:base_score_: 5.40507e+07
```
Online prediction results are affected by `base_score_` parameters. After deleting the if condition(`if (out_preds->size() == 1)`) , the online prediction is consistent with the batch prediction results, and the xgboost prediction results are consistent with python version. Therefore, it is likely that the online prediction function is bug
* [jvm-packages] call setGroup for ranking task
* passing groupData through xgBoostConfMap
* fix original comment position
* make groupData param
* remove groupData variable, use xgBoostConfMap directly
* set default groupData value
* add use groupData tests
* reduce rank-demo size
* use TaskContext.getPartitionId() instead of mapPartitionsWithIndex
* add DF use groupData test
* remove unused varable
* add back train method but mark as deprecated
* fix scalastyle error
* first commit in scala binding for fast histo
* java test
* add missed scala tests
* spark training
* add back train method but mark as deprecated
* fix scalastyle error
* local change
* first commit in scala binding for fast histo
* local change
* fix df frame test
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* bump spark version to 2.1
* preserve num_class issues
* fix failed test cases
* rivising
* add multi class test
verbose_eval docs claim it will log the last iteration (http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). this is also consistent w/the behavior from 0.4. not a huge deal but I found it handy to see the last iter's result b/c my period is usually large.
this doesn't address logging the last stage found by early_stopping (as noted in docs) as I'm not sure how to do that.
* [jvm-packages] Scala implementation of the Rabit tracker.
A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).
* [jvm-packages] Updated Akka dependency in pom.xml.
* Refactored the RabitTracker directory structure.
* Fixed premature stopping of connection handler.
Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.
* Added interface IRabitTracker so that user can switch implementations.
* Default timeout duration changes.
* Dependency for Akka tests.
* Removed the main function of RabitTracker.
* A skeleton for testing Akka-based Rabit tracker.
* waitFor() in RabitTracker no longer throws exceptions.
* Completed unit test for the 'start' command of Rabit tracker.
* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)
* Fixed the default timeout duration.
* Use Java container to avoid serialization issues due to intermediate wrappers.
* Added tests for Allreduce/model training using Scala Rabit tracker.
* Added spill-over unit test for the Scala Rabit tracker.
* Fixed a typo.
* Overhaul of RabitTracker interface per code review.
- Removed methods start() waitFor() (no arguments) from IRabitTracker.
- The timeout in start(timeout) is now worker connection timeout, as tcp
socket binding timeout is less intuitive.
- Dropped time unit from start(...) and waitFor(...) methods; the default
time unit is millisecond.
- Moved random port number generation into the RabitTrackerHandler.
- Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.
* More code refactoring and comments.
* Unified timeout constants. Readable tracker status code.
* Add comments to indicate that allReduce is for tests only. Removed all other variants.
* Removed unused imports.
* Simplified signatures of training methods.
- Moved TrackerConf into parameter map.
- Changed GeneralParams so that TrackerConf becomes a standalone parameter.
- Updated test cases accordingly.
* Changed monitoring strategies.
* Reverted monitoring changes.
* Update test case for Rabit AllReduce.
* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.
* More comprehensive test cases for exception handling and worker connection timeout.
* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.
* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.
* Reverted scalastyle-config changes.
* Visibility scope change. Interface tweaks.
* Use match pattern to handle tracker_conf parameter.
* Minor clarification in JNI code.
* Clearer intent in match pattern to suppress warnings.
* Removed Future from constructor. Block in start() and waitFor() instead.
* Revert inadvertent comment changes.
* Removed debugging information.
* Updated test cases that are a bit finicky.
* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
* Fixed BufferUnderFlow bug in decoding tracker 'print' command.
* Merge conflicts resolution.
* A fix regarding the compatibility with python 2.6
the syntax of {n: self.attr(n) for n in attr_names} is illegal in python 2.6
* Update core.py
add a space after comma
As discussed in issue #1978, tree_method=hist ignores the parameter
param.num_roots; it simply assumes that the tree has only one root. In
particular, when InitData() method initializes row_set_collection_, it simply
assigns all rows to node 0, the value that's hard-coded.
For now, the updater will simply fail when num_roots exceeds 1. I will revise
the updater soon to support multiple roots.
* [R] xgb.save must work when handle in nil but raw exists
* [R] print.xgb.Booster should still print other info when handle is nil
* [R] rename internal function xgb.Booster to xgb.Booster.handle to make its intent clear
* [R] rename xgb.Booster.check to xgb.Booster.complete and make it visible; more docs
* [R] storing evaluation_log should depend only on watchlist, not on verbose
* [R] reduce the excessive chattiness of unit tests
* [R] only disable some tests in windows when it's not 64-bit
* [R] clean-up xgb.DMatrix
* [R] test xgb.DMatrix loading from libsvm text file
* [R] store feature_names in xgb.Booster, use them from utility functions
* [R] remove non-functional co-occurence computation from xgb.importance
* [R] verbose=0 is enough without a callback
* [R] added forgotten xgb.Booster.complete.Rd; cran check fixes
* [R] update installation instructions
* added the max_features parameter to the plot_importance function.
* renamed max_features parameter to max_num_features for better understanding
* removed unwanted character in docstring
* Support histogram-based algorithm + multiple tree growing strategy
* Add a brand new updater to support histogram-based algorithm, which buckets
continuous features into discrete bins to speed up training. To use it, set
`tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
* `grow_policy=depthwise` (default): favor splitting at nodes closest to the
root, i.e. grow depth-wise.
* `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
* Unroll critical loops
* Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`
* Adding a small test for hist method
* Fix memory error in row_set.h
When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.
* Resolve cross-platform compilation issues
* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
(which uses `using` to declate type aliases). So always use `typedef`.
* Fix a host of CI issues
* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging
* Enable tree_method=hist in R
* Renaming HistMaker to GHistBuilder to avoid confusion
* Fix R integration
* Respond to style comments
* Consistent tie-breaking for priority queue using timestamps
* Last-minute style fixes
* Fix issuecomment-271977647
The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.
Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.
* Fix issuecomment-272266358
* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe
* Fix CI issue -- do not use xrange(*)
* Fix corner case in quantile sketch
Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>
* Adding a test for an edge case in quantile sketcher
max_bin=2 used to cause an exception.
* Fix fast_hist test
The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)
Solution: do not require monotonic increase for this particular example.
[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
* option to shuffle data in mknfolds
* removed possibility to run as stand alone test
* split function def in 2 lines for lint
* option to shuffle data in mknfolds
* removed possibility to run as stand alone test
* split function def in 2 lines for lint
* fix cran check
* change required R version because of utils::globalVariables
* temporary commit, monotone not working
* fix test
* fix doc
* fix doc
* fix cran note and warning
* improve checks
* fix urls
* fix cran check
* add cleanup and bump up version number
* use clean in build
* Update Makefile
* [R-package] JSON tree dump interface
* [R-package] precision bugfix in xgb.attributes
* [R-package] bugfix for cb.early.stop called from xgb.cv
* [R-package] a bit more clarity on labels checking in xgb.cv
* [R-package] test JSON dump for gblinear as well
* whitespace lint
* [jvm-packages] Scala implementation of the Rabit tracker.
A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).
* [jvm-packages] Updated Akka dependency in pom.xml.
* Refactored the RabitTracker directory structure.
* Fixed premature stopping of connection handler.
Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.
* Added interface IRabitTracker so that user can switch implementations.
* Default timeout duration changes.
* Dependency for Akka tests.
* Removed the main function of RabitTracker.
* A skeleton for testing Akka-based Rabit tracker.
* waitFor() in RabitTracker no longer throws exceptions.
* Completed unit test for the 'start' command of Rabit tracker.
* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)
* Fixed the default timeout duration.
* Use Java container to avoid serialization issues due to intermediate wrappers.
* Added tests for Allreduce/model training using Scala Rabit tracker.
* Added spill-over unit test for the Scala Rabit tracker.
* Fixed a typo.
* Overhaul of RabitTracker interface per code review.
- Removed methods start() waitFor() (no arguments) from IRabitTracker.
- The timeout in start(timeout) is now worker connection timeout, as tcp
socket binding timeout is less intuitive.
- Dropped time unit from start(...) and waitFor(...) methods; the default
time unit is millisecond.
- Moved random port number generation into the RabitTrackerHandler.
- Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.
* More code refactoring and comments.
* Unified timeout constants. Readable tracker status code.
* Add comments to indicate that allReduce is for tests only. Removed all other variants.
* Removed unused imports.
* Simplified signatures of training methods.
- Moved TrackerConf into parameter map.
- Changed GeneralParams so that TrackerConf becomes a standalone parameter.
- Updated test cases accordingly.
* Changed monitoring strategies.
* Reverted monitoring changes.
* Update test case for Rabit AllReduce.
* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.
* More comprehensive test cases for exception handling and worker connection timeout.
* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.
* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.
* Reverted scalastyle-config changes.
* Visibility scope change. Interface tweaks.
* Use match pattern to handle tracker_conf parameter.
* Minor clarification in JNI code.
* Clearer intent in match pattern to suppress warnings.
* Removed Future from constructor. Block in start() and waitFor() instead.
* Revert inadvertent comment changes.
* Removed debugging information.
* Updated test cases that are a bit finicky.
* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
The GetWeight is a wrapper which sets the correct weight
if the weights vector is not provided. Hence accessing the default
weights vector is not recommended.
Update the code coverage of the project on codecov for easy viewing.
Also the gcov on travis uses a different version which cannot
find the directory of the given files, and it needs to be specified
in the -o flag. Hence now we loop over the list of files and
run them independently.
* [CORE] allow updating trees in an existing model
* [CORE] in refresh updater, allow keeping old leaf values and update stats only
* [R-package] xgb.train mod to allow updating trees in an existing model
* [R-package] added check for nrounds when is_update
* [CORE] merge parameter declaration changes; unify their code style
* [CORE] move the update-process trees initialization to Configure; rename default process_type to 'default'; fix the trees and trees_to_update sizes comparison check
* [R-package] unit tests for the update process type
* [DOC] documentation for process_type parameter; improved docs for updater, Gamma and Tweedie; added some parameter aliases; metrics indentation and some were non-documented
* fix my sloppy merge conflict resolutions
* [CORE] add a TreeProcessType enum
* whitespace fix
* fix cran check
* change required R version because of utils::globalVariables
* temporary commit, monotone not working
* fix test
* fix doc
* fix doc
* fix cran note and warning
* improve checks
* fix urls
* Fix various typos
* Add override to functions that are overridden
gcc gives warnings about functions that are being overridden by not
being marked as oveirridden. This fixes it.
* Use bst_float consistently
Use bst_float for all the variables that involve weight,
leaf value, gradient, hessian, gain, loss_chg, predictions,
base_margin, feature values.
In some cases, when due to additions and so on the value can
take a larger value, double is used.
This ensures that type conversions are minimal and reduces loss of
precision.
* Allow using learning_rates parameter when doing CV
- Create a new `callback_cv` method working when called from `xgb.cv()`
- Rename existing `callback` into `callback_train` and make it the default callback
- Get the logic out of the callbacks and place it into a common helper
* Add a learning_rates parameter to cv()
* lint
* remove caller explicit reference
* callback is aware of its calling context
* remove caller argument
* remove learning_rates param
* restore learning_rates for training, but deprecated
* lint
* lint line too long
* quick example for predefined callbacks
In ecb3a271be the silent argument
in XGDMatrixCreateFromFile of c_api.cc was always overridden to
be false. This disabled the functionality to hide log messages.
This commit reverts that part to enable the hiding of log messages.
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* update methods in test cases to be consistent
* add blank lines
* fix
On Unix systems, it's common for programs to read their input from stdin, and
write their output to stdout. Messages should be written to stderr, where they
won't corrupt a program's output, and where they can be seen by the user even
if the output is being redirected.
This is mostly a problem when XGBoost is being used from Python or from another
program.
* add support for tweedie regression
* added back readme line that was accidentally deleted
* fixed linting errors
* add support for tweedie regression
* added back readme line that was accidentally deleted
* fixed linting errors
* rebased with upstream master and added R example
* changed parameter name to tweedie_variance_power
* linting error fix
* refactored tweedie-nloglik metric to be more like the other parameterized metrics
* added upper and lower bound check to tweedie metric
* add support for tweedie regression
* added back readme line that was accidentally deleted
* fixed linting errors
* added upper and lower bound check to tweedie metric
* added back readme line that was accidentally deleted
* rebased with upstream master and added R example
* rebased again on top of upstream master
* linting error fix
* added upper and lower bound check to tweedie metric
* rebased with master
* lint fix
* removed whitespace at end of line 186 - elementwise_metric.cc
* Fix typos and messages in docs
* parameter.md: Add docs for updater_seq
Mention the updater_seq parameter which sets the order of the tree
updaters to run and also specifies which ones to run. This can be
useful when pruning is not required or even a custom plugin is
being built along with xgboost.
* Add format to the params accepted by DumpModel
Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.
Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.
* Fix typos and errors in docs
* plugin: Mention all the register macros available
Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.
* sparce_page_source: Use same arg name in .h and .cc
* gbm: Add JSON dump
The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.
The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
- The item is the bias and weights vectors
For gbtree:
- The item is the root node. The root node has a attribute "children"
which holds the children nodes. This happens recursively.
* core.py: Add arg dump_format for get_dump()
The `cd ..;` in the one liner takes you up a directory instead of into the xgboost directory. This will cause that step of the installation to fail. It seems like you are meant to enter the xgboost directory as you did in the instructions for installing xgboost without openmp.
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* fix mis configuration
* make DMatrix._init_from_npy2d only copy data when necessary
When creating DMatrix from a 2d ndarray, it can unnecessarily copy the input data. This can be problematic when the data is already very large--running out of memory. The copy is temporary (going out of scope at the end of this function) but it still adds to peak memory usage.
``numpy.array`` copies its input no matter what by default. By adding ``copy=False``, it will only do so when necessary. Since XGDMatrixCreateFromMat is readonly on the input buffer, this copy is not needed.
Also added comments explaining when a copy can happen (if data ordering/layout is wrong or if type is not 32-bit float).
* remove whitespace
* correct CalcDCG in rank_metric.cc
DCG use log base-2, however `std::log` returns log base-e.
* correct CalcDCG in rank_obj.cc
DCG use log base-2, however `std::log` returns log base-e.
* use std::log2 instead of std::log
make it more elegant
* use std::log2 instead of std::log
make it more elegant
*Fix 1439
*Fix python_wrapper when eval set name contain '-' will cause early_stop maximize variable con't set to True propely
Change-Id: Ib0595afd4ae7b445a84c00a3a8faeccc506c6d13
* Changes for Mingw64 compilation to ensure long is a consistent size.
Mainly impacts the Java API which would not compile, but there may be
silent errors on Windows with large datasets before this patch (as long
is 32-bits when compiled with mingw64 even in 64-bit mode).
* Adding ifdefs to ensure it still compiles on MacOS
* Makefile and create_jni.bat changes for Windows.
* Switching XGDMatrixCreateFromCSREx JNI call to use size_t cast
* Fixing lint error, adding profile switching to jvm-packages build to make create-jni.bat get called, adding myself to Contributors.Md
add_library(libxgboost SHARED ${SOURCES}) builds a library named
liblibxgboost.so; However, simply changing it to add_library(xgboost ...)
won't work, as add_executable(xgboost ...) and add_library(xgbboost ...)
will then have the same target name. This patch correctly handles the
same-name situation through SET_TARGET_PROPERTIES.
* add scikit-learn v0.18 compatibility
import KFold & StratifiedKFold from sklearn.model_selection instead of sklearn.cross_validation
* change DeprecationWarning to ImportError
DeprecationWarning isn't an exception, so it should work the other way around.
ml.dmlc.xgboost4j.scala.spark.XGBoost.scala:51
values is empty when we meet it at first time, so values(0) throw an IndexOutOfBoundsException.
It should be dVector.values(i) instead of values(i).
* bump up to scala 2.11
* framework of data frame integration
* test consistency between RDD and DataFrame
* order preservation
* test order preservation
* example code and fix makefile
* improve type checking
* improve APIs
* user docs
* work around travis CI's limitation on log length
* adjust test structure
* integrate with Spark -1 .x
* spark 2.x integration
* remove spark 1.x implementation but provide instructions on how to downgrade
* [TREE] Experimental version of monotone constraint
* Allow default detection of montone option
* loose the condition of strict check
* Update gbtree.cc
Fixed to work with future versions of visual studio i.e., 2015
MSVC has it's own section for setting compile parameters, it shouldn't need to fall into section below i.e., checking for c++11 as this is definitely already supported, though this isn't an issue for Visual Studio 2012, it breaks for later versions
of visual studio i.e., 2015 when the default c++ is version 14. Though still backward compatible with c++11
* test consistency of prediction functions between DMatrix and RDD
* remove APIs with DMatrix from xgboost-spark
* fix compilation error in xgboost4j-example
* fix test cases
Currently xgboost can only be installed by running:
python setup.py install
Now it can be packaged (in binary form) as a wheel and installed like:
pip install xgboost-0.6-py2-none-any.whl
distutils and wheel install `data_files` differently than setuptools.
setuptools will install the `data_files` in the package directory whereas the
others install it in `sys.prefix`. By adding `sys.prefix` to the list of
directories to check for the shared library, xgboost can now be distributed as
a wheel.
* Fixed OpenMP installation on MacOSX with gcc-6
- Modified makefile from gcc-5 to gcc-6
- Removed deprecated install instructions from doc (gcc-5 was automatically forced if available in makefile on OSX)
* Fixed OpenMP installation on MacOSX with gcc-6
- Modified makefile from gcc-5 to gcc-6
- Removed deprecated install instructions from doc (gcc-5 was automatically forced if available in makefile on OSX)
make math better, specifically, unify the notation for Theta or theta. changed basic linear model notation from weight w to theta to make more consistent. Changed Obj function notation also
* force gcc-5 or clang-omp for Mac OS, prepare for pip pack
* add sklearn dep, make -j4
* finalize PyPI submission
* revert to Xcode clang for passing build #1468
* force to clang, try to solve cmake travis error
* remove sklearn dependency
* [R] do not remove zero coefficients from gblinear dump
* [R] switch from stringr to stringi
* fix#1399
* [R] separate ggplot backend, add base r graphics, cleanup, more plots, tests
* add missing include in amalgamation - fixes building R package in linux
* add forgotten file
* [R] fix DESCRIPTION
* [R] fix travis check issue and some cleanup
* Add deviance metric for gamma regression
* Simplify the computation of nloglik for gamma regression
* Add a description for gamma-deviance
* Minor fix
* Add support for Gamma regression
* Use base_score to replace the lp_bias
* Remove the lp_bias config block
* Add a demo for running gamma regression in Python
* Typo fix
* Revise the description for objective
* Add a script to generate the autoclaims dataset
* create dmatrix with specified missing value
* update dmlc-core
* support for predict method in spark package
repartitioning
work around
* add more elements to work around training set empty partition issue
* added new function to calculate other feature importances
* added capability to plot other feature importance measures
* changed plotting default to fscore
* added info on importance_type to boilerplate comment
* updated text of error statement
* added self module name to fix call
* added unit test for feature importances
* style fixes
This error message can be hard to understand when there are several fields, as shown in the example below. This improves the error message, letting the user know which fields were unexpected or missing.
import xgboost as xgb
import pandas as pd
train = pd.DataFrame({'a':[1], 'b':[2], 'c':[3], 'd':[4], 'f':[2], 'g':2, 'etc etc etc':[11]})
dtrain = xgb.DMatrix(train.drop('d', axis=1), train.d)
test = pd.DataFrame({'a':[1], 'b':[2], 'c':[1], 'd':[4], 'e':[2], 'f':[2], 'g':2, 'etc etc etc':[11]})
dtest = xgb.DMatrix(test)
modl = xgb.train({}, dtrain)
modl.predict(dtest)
# ValueError: feature_names mismatch: [u'a', u'b', u'c', u'etc etc etc', u'f', u'g'] [u'a', u'b', u'c', u'd', u'e', u'etc etc etc', u'f', u'g']
- fixed bug if both eval_metrics xgb-param and
metrics param of cv function have been set
- cv early stopping output looks now like the one of xgb.train
Link for line 26 was wrong, it pointed out again for the last demo. I was reading the readme and found the subtle inconsistence. Please, accept this minor change. It works correctly now.
- ensures same behavior for verbose_eval=0 and verbose_eval=False
- fix printing last eval message if early_stopping_rounds is set, but xgb
runs to the end
bed6320 Merge pull request #26 from DrAndrey/master
291ab05 Remove redundant whitespace again
de25163 Remove redundant whitespace
3a6be65 Fix bug with name of sleep function
git-subtree-dir: subtree/rabit
git-subtree-split: bed63208af
- best_ntree_limit as new booster atrribute added
- usage of bst.best_ntree_limit in python doc added
- fixed wrong 'best_iteration' after training continuation
- allows feval to return a list of tuples (name, error/score value)
- changed behavior for multiple eval_metrics in conjunction with
early_stopping: Instead of raising an error, the last passed evel_metric
(or last entry in return value of feval) is used for early stopping
- allows list of eval_metrics in dict-typed params
- unittest for new features / behavior
documentation updated
- example for assigning a list to 'eval_metric'
- note about early stopping on last passed eval metric
- info msg for used eval metric added
Squashed commits:
[9109887] Added test for eta decay(+1 squashed commit)
Squashed commits:
[1336bd4] Added tests for eta decay (+2 squashed commit)
Squashed commits:
[91aac2d] Added tests for eta decay (+1 squashed commit)
Squashed commits:
[3ff48e7] Added test for eta decay
[6bb1eed] Rewrote Rd files
[bf0dec4] Added learning_rates for diff eta in each boosting round
- Pandas DataFrame supports more dtypes than 'int64', 'float64' and 'bool', therefor added a bunch of extra dtypes for the data variable.
- From now on the label variable can be a Pandas DataFrame with the same dtypes as the data variable.
- If label is a Pandas DataFrame will be converted to float.
- If no feature_types is set, the data dtypes will be converted to 'int' or 'float'.
- The feature_names may contain every character except [, ] or <
Changed the name of eval_results to evals_result, so that the naming is the same in training.py and sklearn.py
Made the structure of evals_result the same as in training.py, the names of the keys are different:
In sklearn.py you cannot name your evals_result, but they are automatically called 'validation_0', 'validation_1' etc.
The dict evals_result will output something like: {'validation_0': {'logloss': ['0.674800', '0.657121']}, 'validation_1': {'logloss': ['0.63776', '0.58372']}}
In training.py you can name your multiple evals_result with a watchlist like: watchlist = [(dtest,'eval'), (dtrain,'train')]
The dict evals_result will output something like: {'train': {'logloss': ['0.68495', '0.67691']}, 'eval': {'logloss': ['0.684877', '0.676767']}}
You can access the evals_result using the evals_result() function.
Made changes to training.py to make sure all eval_metric information get passed to evals_result. Previous version lost and mislabeled data in evals_result when using more than one eval_metric.
Structure of eval_metric is now:
eval_metric[evals][eval_metric] = list of metrics
Example:
>>> dtrain = xgb.DMatrix('agaricus.txt.train', silent=True)
>>> dtest = xgb.DMatrix('agaricus.txt.test', silent=True)
>>> param = [('max_depth', 2), ('objective', 'binary:logistic'), ('bst:eta', 0.01), ('eval_metric', 'logloss'), ('eval_metric', 'error')]
>>> watchlist = [(dtest,'eval'), (dtrain,'train')]
>>> num_round = 3
>>> evals_result = {}
>>> bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=evals_result)
>>> print(evals_result['eval']['logloss'])
>>> print(evals_result)
Prints:
['0.684877', '0.676767', '0.668817']
{'train': {'logloss': ['0.684954', '0.676917', '0.669036'], 'error': ['0.04652', '0.04652', '0.04652']}, 'eval': {'logloss': ['0.684877', '0.676767', '0.668817'], 'error': ['0.042831', '0.042831', '0.042831']}}
Currently `pip install xgboost` will raise traceback like this
```
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/tmp/pip-build-IAdqYE/xgboost/setup.py", line 20, in <module>
import xgboost
File "./xgboost/__init__.py", line 8, in <module>
from .core import DMatrix, Booster
File "./xgboost/core.py", line 12, in <module>
import numpy as np
ImportError: No module named numpy
```
We should avoid importing numpy in setup.py and let pip install numpy and scipy automatically.
That's what `install_requires` for.
Issue in "demo(package="xgboost", custom_objective)"
> bst <- xgb.train(param, dtrain, num_round, watchlist,
+ objective=logregobj, eval_metric=evalerror)
Error in xgb.train(param, dtrain, num_round, watchlist, objective = logregobj, :
Duplicated term in parameters. Please check your list of params.
Somewhat more robust and clear logic in stratified CV to guess classification/regression settings. Allows to accomodate custom objectives (classification is assumed when number of unique values in labels <= 5).
d4ec037 fix rabit
6612fcf Merge branch 'master' of ssh://github.com/tqchen/rabit
d29892c add mock option statis
4fa054e new tracker
75c647c update tracker for host IP
e4ce8ef add hadoop linear example
76ecb4a add hadoop linear example
2e1c4c9 add hadoop linear example
git-subtree-dir: subtree/rabit
git-subtree-split: d4ec037f2e
- DLL import now works when __file__ is a relative path
- Various PEP8 and whitespace fixes + whitespace cleanup
- Docstring fixes (conform to numpydoc)
- Added __all__ to the module
- Fixed mutable default values
- Removed print statements
- py2/py3-compatible string-type checks
- Replace asserts with proper exceptions
- Make classes new-style (derive from object)
85b7463 change def of reducer to take function ptr
fe6366e add engine base
a98720e more deps
git-subtree-dir: subtree/rabit
git-subtree-split: 85b746394e
Thanks for participating in the XGBoost community! We use https://discuss.xgboost.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first :)
Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.
For bug reports, to help the developer act on the issues, please include a description of your environment, preferably a minimum script to reproduce the problem.
For feature proposals, list clear, small actionable items so we can track the progress of the change.
XGBoost has been developed and used by a group of active community. Everyone is more than welcomed to is a great way to make the project better and more accessible to more users.
Committers
----------
Committers are people who have made substantial contribution to the project and granted write access to the project.
* [Tianqi Chen](https://github.com/tqchen), University of Washington
- Tianqi is a PhD working on large-scale machine learning, he is the creator of the project.
* [Tong He](https://github.com/hetong007), Amazon AI
- Tong is an applied scientist in Amazon AI, he is the maintainer of xgboost R package.
- Micheal is a lawyer, data scientist in France, he is the creator of xgboost interactive analysis module in R.
* [Yuan Tang](https://github.com/terrytangyuan)
- Yuan is a data scientist in Chicago, US. He contributed mostly in R and Python packages.
* [Nan Zhu](https://github.com/CodingCat)
- Nan is a software engineer in Microsoft. He contributed mostly in JVM packages.
* [Sergei Lebedev](https://github.com/superbobry)
- Serget is a software engineer in Criteo. He contributed mostly in JVM packages.
Become a Committer
------------------
XGBoost is a opensource project and we are actively looking for new committers who are willing to help maintaining and lead the project.
Committers comes from contributors who:
* Made substantial contribution to the project.
* Willing to spent time on maintaining and lead the project.
New committers will be proposed by current committer members, with support from more than two of current committers.
List of Contributors
--------------------
* [Full List of Contributors](https://github.com/dmlc/xgboost/graphs/contributors)
- To contributors: please add your name to the list when you submit a patch to the project:)
* [Kailong Chen](https://github.com/kalenhaha)
- Kailong is an early contributor of xgboost, he is creator of ranking objectives in xgboost.
* [Skipper Seabold](https://github.com/jseabold)
- Skipper is the major contributor to the scikit-learn module of xgboost.
* [Zygmunt Zając](https://github.com/zygmuntz)
- Zygmunt is the master behind the early stopping feature frequently used by kagglers.
* [Ajinkya Kale](https://github.com/ajkl)
* [Boliang Chen](https://github.com/cblsjtu)
* [Yangqing Men](https://github.com/yanqingmen)
- Yangqing is the creator of xgboost java package.
* [Engpeng Yao](https://github.com/yepyao)
* [Giulio](https://github.com/giuliohome)
- Giulio is the creator of windows project of xgboost
* [Jamie Hall](https://github.com/nerdcha)
- Jamie is the initial creator of xgboost sklearn module.
* [Yen-Ying Lee](https://github.com/white1033)
* [Masaaki Horikoshi](https://github.com/sinhrks)
- Masaaki is the initial creator of xgboost python plotting module.
* [Hongliang Liu](https://github.com/phunterlau)
* [Hyunsu Cho](http://hyunsu-cho.io/)
- Hyunsu is the maintainer of the XGBoost Python package. He is in charge of submitting the Python package to Python Package Index (PyPI). He is also the initial author of the CPU 'hist' updater.
* [daiyl0320](https://github.com/daiyl0320)
- daiyl0320 contributed patch to xgboost distributed version more robust, and scales stably on TB scale datasets.
This file records the changes in xgboost library in reverse chronological order.
## v0.80 (2018.08.13)
* **JVM packages received a major upgrade**: To consolidate the APIs and improve the user experience, we refactored the design of XGBoost4J-Spark in a significant manner. (#3387)
- Consolidated APIs: It is now much easier to integrate XGBoost models into a Spark ML pipeline. Users can control behaviors like output leaf prediction results by setting corresponding column names. Training is now more consistent with other Estimators in Spark MLLIB: there is now one single method `fit()` to train decision trees.
- Better user experience: we refactored the parameters relevant modules in XGBoost4J-Spark to provide both camel-case (Spark ML style) and underscore (XGBoost style) parameters
- A brand-new tutorial is [available](https://xgboost.readthedocs.io/en/release_0.80/jvm/xgboost4j_spark_tutorial.html) for XGBoost4J-Spark.
- Latest API documentation is now hosted at https://xgboost.readthedocs.io/.
* XGBoost documentation now keeps track of multiple versions:
- Query ID column support in LIBSVM data files (#2749). This is convenient for performing ranking task in distributed setting.
- Hinge loss for binary classification (`binary:hinge`) (#3477)
- Ability to specify delimiter and instance weight column for CSV files (#3546)
- Ability to use 1-based indexing instead of 0-based (#3546)
* GPU support
- Quantile sketch, binning, and index compression are now performed on GPU, eliminating PCIe transfer for 'gpu_hist' algorithm (#3319, #3393)
- Upgrade to NCCL2 for multi-GPU training (#3404).
- Use shared memory atomics for faster training (#3384).
- Dynamically allocate GPU memory, to prevent large allocations for deep trees (#3519)
- Fix memory copy bug for large files (#3472)
* Python package
- Importing data from Python datatable (#3272)
- Pre-built binary wheels available for 64-bit Linux and Windows (#3424, #3443)
- Add new importance measures 'total_gain', 'total_cover' (#3498)
- Sklearn API now supports saving and loading models (#3192)
- Arbitrary cross validation fold indices (#3353)
-`predict()` function in Sklearn API uses `best_ntree_limit` if available, to make early stopping easier to use (#3445)
- Informational messages are now directed to Python's `print()` rather than standard output (#3438). This way, messages appear inside Jupyter notebooks.
* R package
- Oracle Solaris support, per CRAN policy (#3372)
* JVM packages
- Single-instance prediction (#3464)
- Pre-built JARs are now available from Maven Central (#3401)
- Add NULL pointer check (#3021)
- Consider `spark.task.cpus` when controlling parallelism (#3530)
- Handle missing values in prediction (#3529)
- Eliminate outputs of `System.out` (#3572)
* Refactored C++ DMatrix class for simplicity and de-duplication (#3301)
* Refactored C++ histogram facilities (#3564)
* Refactored constraints / regularization mechanism for split finding (#3335, #3429). Users may specify an elastic net (L2 + L1 regularization) on leaf weights as well as monotonic constraints on test nodes. The refactor will be useful for a future addition of feature interaction constraints.
* Statically link `libstdc++` for MinGW32 (#3430)
* Enable loading from `group`, `base_margin` and `weight` (see [here](http://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#auxiliary-files-for-additional-information)) for Python, R, and JVM packages (#3431)
* Fix model saving for `count:possion` so that `max_delta_step` doesn't get truncated (#3515)
* Fix loading of sparse CSC matrix (#3553)
* Fix incorrect handling of `base_score` parameter for Tweedie regression (#3295)
## v0.72.1 (2018.07.08)
This version is only applicable for the Python package. The content is identical to that of v0.72.
## v0.72 (2018.06.01)
* Starting with this release, we plan to make a new release every two months. See #3252 for more details.
* Fix a pathological behavior (near-zero second-order gradients) in multiclass objective (#3304)
* Tree dumps now use high precision in storing floating-point values (#3298)
* Submodules `rabit` and `dmlc-core` have been brought up to date, bringing bug fixes (#3330, #3221).
* GPU support
- Continuous integration tests for GPU code (#3294, #3309)
- Abstract 1D vector class now works with multiple GPUs (#3287)
- Generate PTX code for most recent architecture (#3316)
- Fix a memory bug on NVIDIA K80 cards (#3293)
- Address performance instability for single-GPU, multi-core machines (#3324)
* Python package
- FreeBSD support (#3247)
- Validation of feature names in `Booster.predict()` is now optional (#3323)
* Updated Sklearn API
- Validation sets now support instance weights (#2354)
-`XGBClassifier.predict_proba()` should not support `output_margin` option. (#3343) See BREAKING CHANGES below.
* R package:
- Better handling of NULL in `print.xgb.Booster()` (#3338)
- Comply with CRAN policy by removing compiler warning suppression (#3329)
- Updated CRAN submission
* JVM packages
- JVM packages will now use the same versioning scheme as other packages (#3253)
- Update Spark to 2.3 (#3254)
- Add scripts to cross-build and deploy artifacts (#3276, #3307)
- Fix a compilation error for Scala 2.10 (#3332)
* BREAKING CHANGES
-`XGBClassifier.predict_proba()` no longer accepts paramter `output_margin`. The paramater makes no sense for `predict_proba()` because the method is to predict class probabilities, not raw margin scores.
## v0.71 (2018.04.11)
* This is a minor release, mainly motivated by issues concerning `pip install`, e.g. #2426, #3189, #3118, and #3194.
With this release, users of Linux and MacOS will be able to run `pip install` for the most part.
* Refactored linear booster class (`gblinear`), so as to support multiple coordinate descent updaters (#3103, #3134). See BREAKING CHANGES below.
* Fix slow training for multiclass classification with high number of classes (#3109)
* Fix a corner case in approximate quantile sketch (#3167). Applicable for 'hist' and 'gpu_hist' algorithms
* Fix memory leak in DMatrix (#3182)
* New functionality
- Better linear booster class (#3103, #3134)
- Pairwise SHAP interaction effects (#3043)
- Cox loss (#3043)
- AUC-PR metric for ranking task (#3172)
- Monotonic constraints for 'hist' algorithm (#3085)
* GPU support
- Create an abtract 1D vector class that moves data seamlessly between the main and GPU memory (#2935, #3116, #3068). This eliminates unnecessary PCIe data transfer during training time.
- Compatibility fixes for latest Spark versions (#3062, #3093)
* BREAKING CHANGES: Updated linear modelling algorithms. In particular L1/L2 regularisation penalties are now normalised to number of training examples. This makes the implementation consistent with sklearn/glmnet. L2 regularisation has also been removed from the intercept. To produce linear models with the old regularisation behaviour, the alpha/lambda regularisation parameters can be manually scaled by dividing them by the number of training examples.
## v0.7 (2017.12.30)
* **This version represents a major change from the last release (v0.6), which was released one year and half ago.**
* Updated Sklearn API
- Add compatibility layer for scikit-learn v0.18: `sklearn.cross_validation` now deprecated
- Updated to allow use of all XGBoost parameters via `**kwargs`.
- Updated `nthread` to `n_jobs` and `seed` to `random_state` (as per Sklearn convention); `nthread` and `seed` are now marked as deprecated
- Updated to allow choice of Booster (`gbtree`, `gblinear`, or `dart`)
-`XGBRegressor` now supports instance weights (specify `sample_weight` parameter)
- Pass `n_jobs` parameter to the `DMatrix` constructor
- Add `xgb_model` parameter to `fit` method, to allow continuation of training
* Refactored gbm to allow more friendly cache strategy
- Specialized some prediction routine
* Robust `DMatrix` construction from a sparse matrix
* Faster consturction of `DMatrix` from 2D NumPy matrices: elide copies, use of multiple threads
* Automatically remove nan from input data when it is sparse.
- This can solve some of user reported problem of istart != hist.size
* Fix the single-instance prediction function to obtain correct predictions
* Minor fixes
- Thread local variable is upgraded so it is automatically freed at thread exit.
- Fix saving and loading `count::poisson` models
- Fix CalcDCG to use base-2 logarithm
- Messages are now written to stderr instead of stdout
- Keep built-in evaluations while using customized evaluation functions
- Use `bst_float` consistently to minimize type conversion
- Copy the base margin when slicing `DMatrix`
- Evaluation metrics are now saved to the model file
- Use `int32_t` explicitly when serializing version
- In distributed training, synchronize the number of features after loading a data matrix.
* Migrate to C++11
- The current master version now requires C++11 enabled compiled(g++4.8 or higher)
* Predictor interface was factored out (in a manner similar to the updater interface).
* Makefile support for Solaris and ARM
* Test code coverage using Codecov
* Add CPP tests
* Add `Dockerfile` and `Jenkinsfile` to support continuous integration for GPU code
* New functionality
- Ability to adjust tree model's statistics to a new dataset without changing tree structures.
- Ability to extract feature contributions from individual predictions, as described in [here](http://blog.datadive.net/interpreting-random-forests/) and [here](https://arxiv.org/abs/1706.06060).
- Faster, histogram-based tree algorithm (`tree_method='hist'`) .
- GPU/CUDA accelerated tree algorithms (`tree_method='gpu_hist'` or `'gpu_exact'`), including the GPU-based predictor.
- Monotonic constraints: when other features are fixed, force the prediction to be monotonic increasing with respect to a certain specified feature.
- Faster gradient caculation using AVX SIMD
- Ability to export models in JSON format
- Support for Tweedie regression
- Additional dropout options for DART: binomial+1, epsilon
- Ability to update an existing model in-place: this is useful for many applications, such as determining feature importance
* Python package:
- New parameters:
-`learning_rates` in `cv()`
-`shuffle` in `mknfold()`
-`max_features` and `show_values` in `plot_importance()`
-`sample_weight` in `XGBRegressor.fit()`
- Support binary wheel builds
- Fix `MultiIndex` detection to support Pandas 0.21.0 and higher
- Support metrics and evaluation sets whose names contain `-`
- Support feature maps when plotting trees
- Compatibility fix for Python 2.6
- Call `print_evaluation` callback at last iteration
- Use appropriate integer types when calling native code, to prevent truncation and memory error
- Fix shared library loading on Mac OS X
* R package:
- New parameters:
-`silent` in `xgb.DMatrix()`
-`use_int_id` in `xgb.model.dt.tree()`
-`predcontrib` in `predict()`
-`monotone_constraints` in `xgb.train()`
- Default value of the `save_period` parameter in `xgboost()` changed to NULL (consistent with `xgb.train()`).
- It's possible to custom-build the R package with GPU acceleration support.
- Enable JVM build for Mac OS X and Windows
- Integration with AppVeyor CI
- Improved safety for garbage collection
- Store numeric attributes with higher precision
- Easier installation for devel version
- Improved `xgb.plot.tree()`
- Various minor fixes to improve user experience and robustness
- Register native code to pass CRAN check
- Updated CRAN submission
* JVM packages
- Add Spark pipeline persistence API
- Fix data persistence: loss evaluation on test data had wrongly used caches for training data.
- Clean external cache after training
- Implement early stopping
- Enable training of multiple models by distinguishing stage IDs
- Better Spark integration: support RDD / dataframe / dataset, integrate with Spark ML package
- XGBoost4j now supports ranking task
- Support training with missing data
- Refactor JVM package to separate regression and classification models to be consistent with other machine learning libraries
- Support XGBoost4j compilation on Windows
- Parameter tuning tool
- Publish source code for XGBoost4j to maven local repo
- Scala implementation of the Rabit tracker (drop-in replacement for the Java implementation)
- Better exception handling for the Rabit tracker
- Persist `num_class`, number of classes (for classification task)
-`XGBoostModel` now holds `BoosterParams`
- libxgboost4j is now part of CMake build
- Release `DMatrix` when no longer needed, to conserve memory
- Expose `baseMargin`, to allow initialization of boosting with predictions from an external model
- Support instance weights
- Use `SparkParallelismTracker` to prevent jobs from hanging forever
- Expose train-time evaluation metrics via `XGBoostModel.summary`
- Option to specify `host-ip` explicitly in the Rabit tracker
* Documentation
- Better math notation for gradient boosting
- Updated build instructions for Mac OS X
- Template for GitHub issues
- Add `CITATION` file for citing XGBoost in scientific writing
- Fix dropdown menu in xgboost.readthedocs.io
- Document `updater_seq` parameter
- Style fixes for Python documentation
- Links to additional examples and tutorials
- Clarify installation requirements
* Changes that break backward compatibility
- [#1519](https://github.com/dmlc/xgboost/pull/1519) XGBoost-spark no longer contains APIs for DMatrix; use the public booster interface instead.
- [#2476](https://github.com/dmlc/xgboost/pull/2476) `XGBoostModel.predict()` now has a different signature
## v0.6 (2016.07.29)
* Version 0.5 is skipped due to major improvements in the core
* Major refactor of core library.
- Goal: more flexible and modular code as a portable library.
- Switch to use of c++11 standard code.
- Random number generator defaults to ```std::mt19937```.
- Share the data loading pipeline and logging module from dmlc-core.
- Enable registry pattern to allow optionally plugin of objective, metric, tree constructor, data loader.
- Future plugin modules can be put into xgboost/plugin and register back to the library.
- Remove most of the raw pointers to smart ptrs, for RAII safety.
* Add official option to approximate algorithm `tree_method` to parameter.
- Change default behavior to switch to prefer faster algorithm.
- User will get a message when approximate algorithm is chosen.
* Change library name to libxgboost.so
* Backward compatiblity
- The binary buffer file is not backward compatible with previous version.
- The model file is backward compatible on 64 bit platforms.
* The model file is compatible between 64/32 bit platforms(not yet tested).
* External memory version and other advanced features will be exposed to R library as well on linux.
- Previously some of the features are blocked due to C++11 and threading limits.
- The windows version is still blocked due to Rtools do not support ```std::thread```.
* rabit and dmlc-core are maintained through git submodule
- Anyone can open PR to update these dependencies now.
* Improvements
- Rabit and xgboost libs are not thread-safe and use thread local PRNGs
- This could fix some of the previous problem which runs xgboost on multiple threads.
* JVM Package
- Enable xgboost4j for java and scala
- XGBoost distributed now runs on Flink and Spark.
* Support model attributes listing for meta data.
- https://github.com/dmlc/xgboost/pull/1198
- https://github.com/dmlc/xgboost/pull/1166
* Support callback API
- https://github.com/dmlc/xgboost/issues/892
- https://github.com/dmlc/xgboost/pull/1211
- https://github.com/dmlc/xgboost/pull/1264
* Support new booster DART(dropout in tree boosting)
- https://github.com/dmlc/xgboost/pull/1220
* Add CMake build system
- https://github.com/dmlc/xgboost/pull/1314
## v0.47 (2016.01.14)
* Changes in R library
- fixed possible problem of poisson regression.
- switched from 0 to NA for missing values.
- exposed access to additional model parameters.
* Changes in Python library
- throws exception instead of crash terminal when a parameter error happens.
- has importance plot and tree plot functions.
- accepts different learning rates for each boosting round.
- allows model training continuation from previously saved model.
- allows early stopping in CV.
- allows feval to return a list of tuples.
- allows eval_metric to handle additional format.
- improved compatibility in sklearn module.
- additional parameters added for sklearn wrapper.
- added pip installation functionality.
- supports more Pandas DataFrame dtypes.
- added best_ntree_limit attribute, in addition to best_score and best_iteration.
* Java api is ready for use
* Added more test cases and continuous integration to make each build more robust.
## v0.4 (2015.05.11)
* Distributed version of xgboost that runs on YARN, scales to billions of examples
* Direct save/load data and model from/to S3 and HDFS
* Feature importance visualization in R module, by Michael Benesty
* Predict leaf index
* Poisson regression for counts data
* Early stopping option in training
* Native save load support in R and python
- xgboost models now can be saved using save/load in R
- xgboost python model is now pickable
* sklearn wrapper is supported in python module
* Experimental External memory version
## v0.3 (2014.09.07)
* Faster tree construction module
- Allows subsample columns during tree construction via ```bst:col_samplebytree=ratio```
* Support for boosting from initial predictions
* Experimental version of LambdaRank
* Linear booster is now parallelized, using parallel coordinated descent.
* Add [Code Guide](src/README.md) for customizing objective function and evaluation
#' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain,
#' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain.
#' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified
#' by the values of outcome labels.
#' @param folds \code{list} provides a possibility to use a list of pre-defined CV folds
#' (each element must be a vector of test fold's indices). When folds are supplied,
#' the \code{nfold} and \code{stratified} parameters are ignored.
#' @param verbose \code{boolean}, print the statistics during the process
#' @param print_every_n Print each n-th iteration evaluation messages when \code{verbose>0}.
#' Default is 1 which means all messages are printed. This parameter is passed to the
#' \code{\link{cb.print.evaluation}} callback.
#' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' doesn't improve for \code{k} rounds.
#' Setting this parameter engages the \code{\link{cb.early.stop}} callback.
#' @param maximize If \code{feval} and \code{early_stopping_rounds} are set,
#' then this parameter must be set as well.
#' When it is \code{TRUE}, it means the larger the evaluation score the better.
#' This parameter is passed to the \code{\link{cb.early.stop}} callback.
#' @param callbacks a list of callback functions to perform various task during boosting.
#' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
#' parameters' values. User can provide either existing or their own callback methods in order
#' to customize the training process.
#' @param ... other parameters to pass to \code{params}.
#'
#' @details
#' This is the cross validation function for xgboost
#'
#' Parallelization is automatically enabled if OpenMP is present.
#' Number of threads can also be manually specified via "nthread" parameter.
#' The original sample is randomly partitioned into \code{nfold} equal size subsamples.
#'
#' This function only accepts an \code{xgb.DMatrix} object as the input.
#' Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
#'
#' The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
#'
#' All observations are used for both training and validation.
#'
#' Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
#'
#' @return
#' An object of class \code{xgb.cv.synchronous} with the following elements:
#' \itemize{
#' \item \code{call} a function call.
#' \item \code{params} parameters that were passed to the xgboost library. Note that it does not
#' capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
#' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitly passed.
#' \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to the
#' CV-based evaluation means and standard deviations for the training and test CV-sets.
#' It is created by the \code{\link{cb.evaluation.log}} callback.
#' \item \code{niter} number of boosting iterations.
#' \item \code{nfeatures} number of features in training data.
#' \item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
#' parameter or randomly generated.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{pred} CV prediction values available when \code{prediction} is set.
#' It is either vector or matrix (see \code{\link{cb.cv.predict}}).
#' \item \code{models} a liost of the CV folds' models. It is only available with the explicit
#' setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
#' \code{xgb.train} is an advanced interface for training an xgboost model.
#' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
#'
#' @param params the list of parameters. Commonly used ones are:
#' @param params the list of parameters.
#' The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}.
#' Below is a shorter summary:
#'
#' 1. General Parameters
#'
#' \itemize{
#' \item \code{objective} objective function, common ones are
#' \itemize{
#' \item \code{reg:linear} linear regression
#' \item \code{binary:logistic} logistic regression for classification
#' }
#' \item \code{eta} step size of each boosting step
#' \item \code{max.depth} maximum depth of the tree
#' \item \code{nthread} number of thread used in training, if not set, all threads are used
#' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
#' }
#'
#' See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for
#' further details. See also demo/ for walkthrough example in R.
#' @param data takes an \code{xgb.DMatrix} as the input.
#' @param nrounds the max number of iterations
#' @param watchlist what information should be printed when \code{verbose=1} or
#' \code{verbose=2}. Watchlist is used to specify validation set monitoring
#' during training. For example user can specify
#' watchlist=list(validation1=mat1, validation2=mat2) to watch
#' the performance of each round's model on mat1 and mat2
#'
#'
#' 2. Booster Parameters
#'
#' 2.1. Parameter for Tree Booster
#'
#' \itemize{
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
#' }
#'
#' 2.2. Parameter for Linear Booster
#'
#' \itemize{
#' \item \code{lambda} L2 regularization term on weights. Default: 0
#' \item \code{lambda_bias} L2 regularization term on bias. Default: 0
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
#' }
#'
#' 3. Task Parameters
#'
#' \itemize{
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \itemize{
#' \item \code{reg:linear} linear regression (Default).
#' \item \code{reg:logistic} logistic regression.
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{num_class} set the number of classes. To use only with multiclass objectives.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}.
#' \item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
#' }
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
#' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
#' }
#'
#' @param data training dataset. \code{xgb.train} accepts only an \code{xgb.DMatrix} as the input.
#' \code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or name of a local data file.
#' @param nrounds max number of boosting iterations.
#' @param watchlist named list of xgb.DMatrix datasets to use for evaluating model performance.
#' Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
#' of these datasets during each boosting iteration, and stored in the end as a field named
#' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
#' \code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
#' printed out during the training.
#' E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
#' the performance of each round's model on mat1 and mat2.
#' @param obj customized objective function. Returns gradient and second order
#' \item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
#' Different threshold (e.g., 0.) could be specified as "error@0."
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
#' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
For up-to-date version(which is recommended), please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
Resources
---------
* [XGBoost R Package Online Documentation](http://xgboost.readthedocs.org/en/latest/R-package/index.html)
- Check this out for detailed documents, examples and tutorials.
We are [on CRAN](https://cran.r-project.org/web/packages/xgboost/index.html) now. For stable/pre-compiled(for Windows and OS X) version, please install from CRAN:
```r
install.packages('xgboost')
```
## Examples
For more detailed installation instructions, please see [here](http://xgboost.readthedocs.org/en/latest/build.html#r-package-installation).
* Please visit [walk through example](https://github.com/tqchen/xgboost/blob/master/R-package/demo).
* See also the [example scripts](https://github.com/tqchen/xgboost/tree/master/demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/speedtest.R) on this dataset.
Examples
--------
* Please visit [walk through example](demo).
* See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.R) on this dataset and the one related to [Otto challenge](../demo/kaggle-otto), including a [RMarkdown documentation](../demo/kaggle-otto/understandingXGBoostModel.Rmd).
Development
-----------
* See the [R Package section](https://xgboost.readthedocs.io/en/latest/how_to/contribute.html#r-package) of the contributors guide.
# Create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df<-data.table(Arthritis,keep.rownames=F)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
df[,AgeDiscret:=as.factor(round(Age/10,0))]
# Here is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
install.packages('vcd')#Available in Cran. Used for its dataset with categorical values.
require(vcd)
}
# According to its documentation, Xgboost works only on numbers.
# Sometimes the dataset we have to work on have categorical data.
# A categorical variable is one which have a fixed number of values. By example, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
#
# In R, categorical variable is called Factor.
# Type ?factor in console for more information.
#
# In this demo we will see how to transform a dense dataframe with categorical variables to a sparse matrix before analyzing it in Xgboost.
# The method we are going to see is usually called "one hot encoding".
#load Arthritis dataset in memory.
data(Arthritis)
# create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df<-data.table(Arthritis,keep.rownames=F)
# Let's have a look to the data.table
cat("Print the dataset\n")
print(df)
# 2 columns have factor type, one has ordinal type (ordinal variable is a categorical variable with values wich can be ordered, here: None > Some > Marked).
cat("Structure of the dataset\n")
str(df)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
df[,AgeDiscret:=as.factor(round(Age/10,0))]
# Here is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
# We remove ID as there is nothing to learn from this feature (it will just add some noise as the dataset is small).
df[,ID:=NULL]
# List the different values for the column Treatment: Placebo, Treated.
cat("Values of the categorical feature Treatment\n")
print(levels(df[,Treatment]))
# Next step, we will transform the categorical data to dummy variables.
# This method is also called one hot encoding.
# The purpose is to transform each value of each categorical feature in one binary feature.
#
# Let's take, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be binary. For example an observation which had the value Placebo in column Treatment before the transformation will have, after the transformation, the value 1 in the new column Placebo and the value 0 in the new column Treated.
#
# Formulae Improved~.-1 used below means transform all categorical features but column Improved to binary values.
# Column Improved is excluded because it will be our output column, the one we want to predict.
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).
# Does these result make sense?
# Let's check some Chi2 between each of these features and the outcome.
print(chisq.test(df$Age,df$Y))
# Pearson correlation between Age and illness disappearing is 35
print(chisq.test(df$AgeDiscret,df$Y))
# Our first simplification of Age gives a Pearson correlation of 8.
print(chisq.test(df$AgeCat,df$Y))
# The perfectly random split I did between young and old at 30 years old have a low correlation of 2. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Don't let your "gut" lower the quality of your model. In "data science", there is science :-)
# As you can see, in general destroying information by simplifying it won't improve your model. Chi2 just demonstrates that. But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
# However it's almost always worse when you add some arbitrary rules.
# Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age. Linear model may not be that strong in these scenario.
\code{list(metric='metric-name', value='metric-value')} with given
prediction and dtrain.}
\item{stratified}{a \code{boolean} indicating whether sampling of folds should be stratified
by the values of outcome labels.}
\item{folds}{\code{list} provides a possibility to use a list of pre-defined CV folds
(each element must be a vector of test fold's indices). When folds are supplied,
the \code{nfold} and \code{stratified} parameters are ignored.}
\item{verbose}{\code{boolean}, print the statistics during the process}
\item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
Default is 1 which means all messages are printed. This parameter is passed to the
\code{\link{cb.print.evaluation}} callback.}
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
doesn't improve for \code{k} rounds.
Setting this parameter engages the \code{\link{cb.early.stop}} callback.}
\item{maximize}{If \code{feval} and \code{early_stopping_rounds} are set,
then this parameter must be set as well.
When it is \code{TRUE}, it means the larger the evaluation score the better.
This parameter is passed to the \code{\link{cb.early.stop}} callback.}
\item{callbacks}{a list of callback functions to perform various task during boosting.
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
to customize the training process.}
\item{...}{other parameters to pass to \code{params}.}
}
\value{
An object of class \code{xgb.cv.synchronous} with the following elements:
\itemize{
\item \code{call} a function call.
\item \code{params} parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
\item \code{callbacks} callback functions that were either automatically assigned or
explicitly passed.
\item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
first column corresponding to iteration number and the rest corresponding to the
CV-based evaluation means and standard deviations for the training and test CV-sets.
It is created by the \code{\link{cb.evaluation.log}} callback.
\item \code{niter} number of boosting iterations.
\item \code{nfeatures} number of features in training data.
\item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
parameter or randomly generated.
\item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method
(only available with early stopping).
\item \code{pred} CV prediction values available when \code{prediction} is set.
It is either vector or matrix (see \code{\link{cb.cv.predict}}).
\item \code{models} a liost of the CV folds' models. It is only available with the explicit
setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
}
}
\description{
The cross valudation function of xgboost
The cross validation function of xgboost
}
\details{
This is the cross validation function for xgboost
The original sample is randomly partitioned into \code{nfold} equal size subsamples.
Parallelization is automatically enabled if OpenMP is present.
Number of threads can also be manually specified via "nthread" parameter.
Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
This function only accepts an \code{xgb.DMatrix} object as the input.
The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
All observations are used for both training and validation.
Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.