Files

Anthony D'Amato f58e41bad8 Fix deterministic partitioning with dataset containing Double.NaN (#5996 )

The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold.

Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>

2020-08-18 18:55:37 -07:00

dev

[CI] Use devtoolset-6 because devtoolset-4 is EOL and no longer available (#5506 )

2020-04-11 19:49:06 -07:00

xgboost4j

Add CMake flag to log C API invocations, to aid debugging (#5925 )

2020-07-30 19:24:28 -07:00

xgboost4j-example

[BLOCKING] [jvm-packages] add gpu_hist and enable gpu scheduling (#5171 )

2020-07-26 21:53:24 -07:00

xgboost4j-flink

Bump version to 1.2.0 snapshot in master (#5733 )

2020-05-31 00:11:34 -07:00

xgboost4j-spark

Fix deterministic partitioning with dataset containing Double.NaN (#5996 )

2020-08-18 18:55:37 -07:00

xgboost4j-tester

Bump com.esotericsoftware to 4.0.2 (#5690 )

2020-06-13 21:06:14 -07:00

.gitignore

Fix prediction heuristic (#5955 )

2020-07-29 19:24:07 +08:00

checkstyle-suppressions.xml

[jvm-packages] Fixed checkstyle excludes on Windows (#2370 )

2017-06-02 10:14:13 -07:00

checkstyle.xml

apply google-java-style indentation and impose import orders....

2016-03-03 12:59:18 -05:00

CMakeLists.txt

Add option to enable all compiler warnings in GCC/Clang (#5897 )

2020-07-21 23:34:03 -07:00

create_jni.py

Add CMake flag to log C API invocations, to aid debugging (#5925 )

2020-07-30 19:24:28 -07:00

pom.xml

Add CMake flag to log C API invocations, to aid debugging (#5925 )

2020-07-30 19:24:28 -07:00

README.md

[jvm-packages] upgrade to Scala 2.12 (#4574 )

2019-07-16 08:43:34 -07:00

scalastyle-config.xml

Revert "[jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4876 )" (#4965 )

2019-10-18 14:02:35 -07:00

README.md

XGBoost4J: Distributed XGBoost for Scala/Java

Documentation | Resources | Release Notes

XGBoost4J is the JVM package of xgboost. It brings all the optimizations and power xgboost into JVM ecosystem.

Train XGBoost models in scala and java with easy customizations.
Run distributed xgboost natively on jvm frameworks such as Apache Flink and Apache Spark.

You can find more about XGBoost on Documentation and Resource Page.

Add Maven Dependency

XGBoost4J, XGBoost4J-Spark, etc. in maven repository is compiled with g++-4.8.5

Access release version

maven

<dependency>
    <groupId>ml.dmlc</groupId>
    <artifactId>xgboost4j_2.12</artifactId>
    <version>latest_version_num</version>
</dependency>

sbt

 "ml.dmlc" %% "xgboost4j" % "latest_version_num"

For the latest release version number, please check here.

if you want to use xgboost4j-spark, you just need to replace xgboost4j with xgboost4j-spark

Access SNAPSHOT version

You need to add github as repo:

maven:

<repository>
  <id>GitHub Repo</id>
  <name>GitHub Repo</name>
  <url>https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/</url>
</repository>

sbt:

resolvers += "GitHub Repo" at "https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/"

the add dependency as following:

maven

<dependency>
    <groupId>ml.dmlc</groupId>
    <artifactId>xgboost4j_2.12</artifactId>
    <version>latest_version_num</version>
</dependency>

sbt

 "ml.dmlc" %% "xgboost4j" % "latest_version_num"

For the latest release version number, please check here.

if you want to use xgboost4j-spark, you just need to replace xgboost4j with xgboost4j-spark

Examples

Full code examples for Scala, Java, Apache Spark, and Apache Flink can be found in the examples package.

NOTE on LIBSVM Format:

There is an inconsistent issue between XGBoost4J-Spark and other language bindings of XGBoost.

When users use Spark to load trainingset/testset in LibSVM format with the following code snippet:

spark.read.format("libsvm").load("trainingset_libsvm")

Spark assumes that the dataset is 1-based indexed. However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is 0-based indexed. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost.

Development

You can build/package xgboost4j locally with the following steps:

Linux:

Ensure Docker for Linux is installed.
Clone this repo: git clone --recursive https://github.com/dmlc/xgboost.git
Run the following command:

With Tests: ./xgboost/jvm-packages/dev/build-linux.sh
Skip Tests: ./xgboost/jvm-packages/dev/build-linux.sh --skip-tests

Windows:

Ensure Docker for Windows is installed.
Clone this repo: git clone --recursive https://github.com/dmlc/xgboost.git
Run the following command:

With Tests: .\xgboost\jvm-packages\dev\build-linux.cmd
Skip Tests: .\xgboost\jvm-packages\dev\build-linux.cmd --skip-tests

Note: this will create jars for deployment on Linux machines.