Go to file

Sergei Lebedev 771a95aec6 [jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532 )

* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala

This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.

I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.

Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.

* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs

Note that DataFrame-based and Flink APIs are not affected by this change.

* Removed baseMargin argument in favour of the LabeledPoint field

* Do a single pass over the partition in buildDistributedBoosters

Note that there is no formal guarantee that

    val repartitioned = rdd.repartition(42)
    repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }

would do a single shuffle, but in practice it seems to be always the case.

* Exposed baseMargin in DataFrame-based API

* Addressed review comments

* Pass baseMargin to XGBoost.trainWithDataFrame via params

* Reverted MLLabeledPoint in Spark APIs

As discussed, baseMargin would only be supported for DataFrame-based APIs.

* Cleaned up baseMargin tests

- Removed RDD-based test, since the option is no longer exposed via
  public APIs
- Changed DataFrame-based one to check that adding a margin actually
  affects the prediction

* Pleased Scalastyle

* Addressed more review comments

* Pleased scalastyle again

* Fixed XGBoost.fromBaseMarginsToArray

which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.

2017-08-10 14:29:26 -07:00

amalgamation

[WIP] Extract prediction into separate interface (#2531 )

2017-07-28 17:01:03 -07:00

cmake

[GPU-Plugin] Add load balancing search to gpu_hist. Add compressed iterator. (#2504 )

2017-07-11 22:36:39 +12:00

cub @ f3937a96fd

[GPU-Plugin] Multi-GPU gpu_id bug fixes for grow_gpu_hist and grow_gpu methods, and additional documentation for the gpu plugin. (#2463 )

2017-06-30 20:04:17 +12:00

demo

Update bosch.py (#2482 )

2017-07-06 20:05:09 -04:00

dmlc-core @ b5bec5481d

Remove xgboost's thread_local and switch to dmlc::ThreadLocalStore (#2121 )

2017-03-27 09:09:18 -07:00

doc

Fix parameter documentation inconsistencies (#2584 )

2017-08-07 19:07:10 +02:00

include/xgboost

[GPU-Plugin] Various fixes (#2579 )

2017-08-05 22:16:23 +12:00

jvm-packages

[jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532 )

2017-08-10 14:29:26 -07:00

make

Not use -msse2 on power or arm arch. close #2446 (#2475 )

2017-07-06 20:06:55 -04:00

nccl @ 018ff75f78

Update nccl (#2494 )

2017-07-07 12:36:26 +12:00

plugin

[GPU-Plugin] Add throw of asserts and added compute compatibility error check. (#2565 )

2017-08-10 16:07:07 +12:00

python-package

Fix typo in sklearn documentation (#2580 )

2017-08-07 19:06:11 +02:00

R-package

[R] many minor changes to increase the robustness of the R code (#2404 )

2017-06-15 22:56:23 -05:00

rabit @ a764d45cfb

[UPDATE] Update rabit and threadlocal (#2114 )

2017-03-16 18:48:37 -07:00

src

Several fixes (#2572 )

2017-08-06 13:03:50 -05:00

tests

Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530 )

2017-07-21 14:43:17 +12:00

.gitignore

[GPU-Plugin] Multi-GPU gpu_id bug fixes for grow_gpu_hist and grow_gpu methods, and additional documentation for the gpu plugin. (#2463 )

2017-06-30 20:04:17 +12:00

.gitmodules

Several fixes (#2572 )

2017-08-06 13:03:50 -05:00

.travis.yml

[jvm-packages] Another pack of build/CI improvements (#2422 )

2017-06-21 12:28:35 -07:00

appveyor.yml

MinGW: shared library prefix and appveyor CI (#2539 )

2017-07-25 01:06:47 -05:00

build.sh

Add build failure message (#2397 )

2017-06-25 22:32:11 -04:00

CMakeLists.txt

[WIP] Extract prediction into separate interface (#2531 )

2017-07-28 17:01:03 -07:00

CONTRIBUTORS.md

Update CONTRIBUTORS.md (#2350 )

2017-05-27 08:38:32 -07:00

ISSUE_TEMPLATE.md

Update ISSUE_TEMPLATE.md (#2308 )

2017-05-18 08:49:07 -07:00

Jenkinsfile

[BUILD] Dockerfile and Jenkinsfile revisited (#2514 )

2017-07-13 17:51:47 +12:00

LICENSE

update year in LICENSE, conf.py and README.md files

2016-03-15 16:51:34 +03:00

Makefile

Several fixes (#2572 )

2017-08-06 13:03:50 -05:00

NEWS.md

Sklearn kwargs (#2338 )

2017-05-23 21:47:53 -05:00

README.md

[GPU-Plugin] (#2227 )

2017-04-25 16:37:10 -07:00

README.md

eXtreme Gradient Boosting

Documentation | Resources | Installation | Release Notes | RoadMap

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

What's New

Ask a Question

For reporting bugs please use the xgboost/issues page.
For generic questions or to share your experience using XGBoost please use the XGBoost User Group

Help to Make XGBoost Better

XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone.

Check out call for contributions and Roadmap to see what can be improved, or open an issue if you want something.
Contribute to the documents and examples to share your experience with other users.
Add your stories and experience to Awesome XGBoost.
Please add your name to CONTRIBUTORS.md and after your patch has been merged.
- Please also update NEWS.md on changes and improvements in API and docs.

License

Reference

Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016
XGBoost originates from research project at University of Washington, see also the Project Page at UW.

Languages

C++ 45.5%

Python 20.3%

Cuda 15.2%

R 6.8%

Scala 6.4%

Other 5.6%