Doc modernization (#3474)

* Change doc build to reST exclusively

* Rewrite Intro doc in reST; create toctree

* Update parameter and contribute

* Convert tutorials to reST

* Convert Python tutorials to reST

* Convert CLI and Julia docs to reST

* Enable markdown for R vignettes

* Done migrating to reST

* Add guzzle_sphinx_theme to requirements

* Add breathe to requirements

* Fix search bar

* Add link to user forum
This commit is contained in:
Philip Hyunsu Cho
2018-07-19 14:22:16 -07:00
committed by GitHub
parent c004cea788
commit 05b089405d
57 changed files with 2833 additions and 3957 deletions

View File

@@ -1,134 +0,0 @@
XGBoost JVM Package
===================
[![Build Status](https://travis-ci.org/dmlc/xgboost.svg?branch=master)](https://travis-ci.org/dmlc/xgboost)
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](../LICENSE)
You have found the XGBoost JVM Package!
Installation
------------
#### Installation from source
Building XGBoost4J using Maven requires Maven 3 or newer, Java 7+ and CMake 3.2+ for compiling the JNI bindings.
Before you install XGBoost4J, you need to define environment variable `JAVA_HOME` as your JDK directory to ensure that your compiler can find `jni.h` correctly, since XGBoost4J relies on JNI to implement the interaction between the JVM and native libraries.
After your `JAVA_HOME` is defined correctly, it is as simple as run `mvn package` under jvm-packages directory to install XGBoost4J. You can also skip the tests by running `mvn -DskipTests=true package`, if you are sure about the correctness of your local setup.
To publish the artifacts to your local maven repository, run
mvn install
Or, if you would like to skip tests, run
mvn -DskipTests install
This command will publish the xgboost binaries, the compiled java classes as well as the java sources to your local repository. Then you can use XGBoost4J in your Java projects by including the following dependency in `pom.xml`:
<b>maven</b>
```
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_source_version_num</version>
</dependency>
```
For sbt, please add the repository and dependency in build.sbt as following:
<b>sbt</b>
```sbt
resolvers += "Local Maven Repository" at "file://"+Path.userHome.absolutePath+"/.m2/repository"
"ml.dmlc" % "xgboost4j" % "latest_source_version_num"
```
#### Installation from maven repo
### Access release version
<b>maven</b>
```
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_version_num</version>
</dependency>
```
<b>sbt</b>
```sbt
"ml.dmlc" % "xgboost4j" % "latest_version_num"
```
For the latest release version number, please check [here](https://github.com/dmlc/xgboost/releases).
if you want to use `xgboost4j-spark`, you just need to replace xgboost4j with `xgboost4j-spark`
### Access SNAPSHOT version
You need to add github as repo:
<b>maven</b>:
```xml
<repository>
<id>GitHub Repo</id>
<name>GitHub Repo</name>
<url>https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/</url>
</repository>
```
<b>sbt</b>:
```sbt
resolvers += "GitHub Repo" at "https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/"
```
the add dependency as following:
<b>maven</b>
```
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_version_num</version>
</dependency>
```
<b>sbt</b>
```sbt
"ml.dmlc" % "xgboost4j" % "latest_version_num"
```
For the latest release version number, please check [here](https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j).
if you want to use `xgboost4j-spark`, you just need to replace xgboost4j with `xgboost4j-spark`
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running `mvn package`, and you can specify the version of spark with `mvn -Dspark.version=2.0.0 package`. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like `spark.version`, `scala.version`, and `scala.binary.version`. Users also need to change the implementation by replacing SparkSession with SQLContext and the type of API parameters from Dataset[_] to Dataframe)
#### Enabling OpenMP for Mac OS
If you are on Mac OS and using a compiler that supports OpenMP, you need to go to the file `xgboost/jvm-packages/create_jni.py` and comment out the line
```python
CONFIG["USE_OPENMP"] = "OFF"
```
in order to get the benefit of multi-threading.
Contents
--------
* [Java Overview Tutorial](java_intro.md)
Resources
---------
* [Code Examples](https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example)
* [Java API Docs](http://dmlc.ml/docs/javadocs/index.html)
## Scala API Docs
* [XGBoost4J](http://dmlc.ml/docs/scaladocs/xgboost4j/index.html)
* [XGBoost4J-Spark](http://dmlc.ml/docs/scaladocs/xgboost4j-spark/index.html)
* [XGBoost4J-Flink](http://dmlc.ml/docs/scaladocs/xgboost4j-flink/index.html)

145
doc/jvm/index.rst Normal file
View File

@@ -0,0 +1,145 @@
###################
XGBoost JVM Package
###################
.. raw:: html
<a href="https://travis-ci.org/dmlc/xgboost">
<img alt="Build Status" src="https://travis-ci.org/dmlc/xgboost.svg?branch=master">
</a>
<a href="https://github.com/dmlc/xgboost/blob/master/LICENSE">
<img alt="GitHub license" src="http://dmlc.github.io/img/apache2.svg">
</a>
You have found the XGBoost JVM Package!
************
Installation
************
Installation from source
========================
Building XGBoost4J using Maven requires Maven 3 or newer, Java 7+ and CMake 3.2+ for compiling the JNI bindings.
Before you install XGBoost4J, you need to define environment variable ``JAVA_HOME`` as your JDK directory to ensure that your compiler can find ``jni.h`` correctly, since XGBoost4J relies on JNI to implement the interaction between the JVM and native libraries.
After your ``JAVA_HOME`` is defined correctly, it is as simple as run ``mvn package`` under jvm-packages directory to install XGBoost4J. You can also skip the tests by running ``mvn -DskipTests=true package``, if you are sure about the correctness of your local setup.
To publish the artifacts to your local maven repository, run
.. code-block:: bash
mvn install
Or, if you would like to skip tests, run
.. code-block:: bash
mvn -DskipTests install
This command will publish the xgboost binaries, the compiled java classes as well as the java sources to your local repository. Then you can use XGBoost4J in your Java projects by including the following dependency in ``pom.xml``:
.. code-block:: xml
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_source_version_num</version>
</dependency>
For sbt, please add the repository and dependency in build.sbt as following:
.. code-block:: scala
resolvers += "Local Maven Repository" at "file://"+Path.userHome.absolutePath+"/.m2/repository"
"ml.dmlc" % "xgboost4j" % "latest_source_version_num"
Installation from maven repo
============================
Access release version
----------------------
.. code-block:: xml
:caption: maven
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_version_num</version>
</dependency>
.. code-block:: scala
:caption: sbt
"ml.dmlc" % "xgboost4j" % "latest_version_num"
For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
Access SNAPSHOT version
-----------------------
You need to add GitHub as repo:
.. code-block:: xml
:caption: maven
<repository>
<id>GitHub Repo</id>
<name>GitHub Repo</name>
<url>https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/</url>
</repository>
.. code-block:: scala
:caption: sbt
resolvers += "GitHub Repo" at "https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/"
Then add dependency as following:
.. code-block:: xml
:caption: maven
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j</artifactId>
<version>latest_version_num</version>
</dependency>
.. code-block:: scala
:caption: sbt
"ml.dmlc" % "xgboost4j" % "latest_version_num"
For the latest release version number, please check `here <https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j>`_.
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
Enabling OpenMP for Mac OS
--------------------------
If you are on Mac OS and using a compiler that supports OpenMP, you need to go to the file ``xgboost/jvm-packages/create_jni.py`` and comment out the line
.. code-block:: python
CONFIG["USE_OPENMP"] = "OFF"
in order to get the benefit of multi-threading.
********
Contents
********
.. toctree::
Java Overview Tutorial <java_intro>
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>
XGBoost4J-Spark Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j-spark/index.html>
XGBoost4J-Flink Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j-flink/index.html>

View File

@@ -1,143 +0,0 @@
XGBoost4J Java API
==================
This tutorial introduces
## Data Interface
Like the xgboost python module, xgboost4j uses ```DMatrix``` to handle data,
libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
supported.
* To import ```DMatrix``` :
```java
import org.dmlc.xgboost4j.DMatrix;
```
* To load libsvm text format file, the usage is like :
```java
DMatrix dmat = new DMatrix("train.svm.txt");
```
* To load sparse matrix in CSR/CSC format is a little complicated, the usage is like :
suppose a sparse matrix :
1 0 2 0
4 0 0 3
3 1 2 0
for CSR format
```java
long[] rowHeaders = new long[] {0,2,4,7};
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
int[] colIndex = new int[] {0,2,0,3,0,1,2};
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
```
for CSC format
```java
long[] colHeaders = new long[] {0,3,4,6,7};
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
```
* To load 3*2 dense matrix, the usage is like :
suppose a matrix :
1 2
3 4
5 6
```java
float[] data = new float[] {1f,2f,3f,4f,5f,6f};
int nrow = 3;
int ncol = 2;
float missing = 0.0f;
DMatrix dmat = new Matrix(data, nrow, ncol, missing);
```
* To set weight :
```java
float[] weights = new float[] {1f,2f,1f};
dmat.setWeight(weights);
```
## Setting Parameters
* in xgboost4j any ```Iterable<Entry<String, Object>>``` object could be used as parameters.
* to set parameters, for non-multiple value params, you can simply use entrySet of an Map:
```java
Map<String, Object> paramMap = new HashMap<>() {
{
put("eta", 1.0);
put("max_depth", 2);
put("silent", 1);
put("objective", "binary:logistic");
put("eval_metric", "logloss");
}
};
Iterable<Entry<String, Object>> params = paramMap.entrySet();
```
* for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
```java
List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
{
add(new SimpleEntry<String, Object>("eta", 1.0));
add(new SimpleEntry<String, Object>("max_depth", 2.0));
add(new SimpleEntry<String, Object>("silent", 1));
add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
}
};
```
## Training Model
With parameters and data, you are able to train a booster model.
* Import ```Trainer``` and ```Booster``` :
```java
import org.dmlc.xgboost4j.Booster;
import org.dmlc.xgboost4j.util.Trainer;
```
* Training
```java
DMatrix trainMat = new DMatrix("train.svm.txt");
DMatrix validMat = new DMatrix("valid.svm.txt");
//specify a watchList to see the performance
//any Iterable<Entry<String, DMatrix>> object could be used as watchList
List<Entry<String, DMatrix>> watchs = new ArrayList<>();
watchs.add(new SimpleEntry<>("train", trainMat));
watchs.add(new SimpleEntry<>("test", testMat));
int round = 2;
Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
```
* Saving model
After training, you can save model and dump it out.
```java
booster.saveModel("model.bin");
```
* Dump Model and Feature Map
```java
booster.dumpModel("modelInfo.txt", false)
//dump with featureMap
booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
```
* Load a model
```java
Params param = new Params() {
{
put("silent", 1);
put("nthread", 6);
}
};
Booster booster = new Booster(param, "model.bin");
```
## Prediction
after training and loading a model, you use it to predict other data, the predict results will be a two-dimension float array (nsample, nclass), for predict leaf, it would be (nsample, nclass*ntrees)
```java
DMatrix dtest = new DMatrix("test.svm.txt");
//predict
float[][] predicts = booster.predict(dtest);
//predict leaf
float[][] leafPredicts = booster.predict(dtest, 0, true);
```

177
doc/jvm/java_intro.rst Normal file
View File

@@ -0,0 +1,177 @@
##################
XGBoost4J Java API
##################
This tutorial introduces Java API for XGBoost.
**************
Data Interface
**************
Like the XGBoost python module, XGBoost4J uses ``DMatrix`` to handle data,
libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
supported.
* The first step is to import ``DMatrix``:
.. code-block:: java
import org.dmlc.xgboost4j.DMatrix;
* Use ``DMatrix`` constructor to load data from a libsvm text format file:
.. code-block:: java
DMatrix dmat = new DMatrix("train.svm.txt");
* Pass arrays to ``DMatrix`` constructor to load from sparse matrix.
Suppose we have a sparse matrix
.. code-block:: none
1 0 2 0
4 0 0 3
3 1 2 0
We can express the sparse matrix in `Compressed Sparse Row (CSR) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)>`_ format:
.. code-block:: java
long[] rowHeaders = new long[] {0,2,4,7};
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
int[] colIndex = new int[] {0,2,0,3,0,1,2};
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
.. code-block:: java
long[] colHeaders = new long[] {0,3,4,6,7};
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
* You may also load your data from a dense matrix. Let's assume we have a matrix of form
.. code-block:: none
1 2
3 4
5 6
Using `row-major layout <https://en.wikipedia.org/wiki/Row-_and_column-major_order>`_, we specify the dense matrix as follows:
.. code-block:: java
float[] data = new float[] {1f,2f,3f,4f,5f,6f};
int nrow = 3;
int ncol = 2;
float missing = 0.0f;
DMatrix dmat = new Matrix(data, nrow, ncol, missing);
* To set weight:
.. code-block:: java
float[] weights = new float[] {1f,2f,1f};
dmat.setWeight(weights);
******************
Setting Parameters
******************
* In XGBoost4J any ``Iterable<Entry<String, Object>>`` object could be used as parameters.
* To set parameters, for non-multiple value params, you can simply use entrySet of an Map:
.. code-block:: java
Map<String, Object> paramMap = new HashMap<>() {
{
put("eta", 1.0);
put("max_depth", 2);
put("silent", 1);
put("objective", "binary:logistic");
put("eval_metric", "logloss");
}
};
Iterable<Entry<String, Object>> params = paramMap.entrySet();
* for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
.. code-block:: java
List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
{
add(new SimpleEntry<String, Object>("eta", 1.0));
add(new SimpleEntry<String, Object>("max_depth", 2.0));
add(new SimpleEntry<String, Object>("silent", 1));
add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
}
};
**************
Training Model
**************
With parameters and data, you are able to train a booster model.
* Import ``Trainer`` and ``Booster``:
.. code-block:: java
import org.dmlc.xgboost4j.Booster;
import org.dmlc.xgboost4j.util.Trainer;
* Training
.. code-block:: java
DMatrix trainMat = new DMatrix("train.svm.txt");
DMatrix validMat = new DMatrix("valid.svm.txt");
//specify a watchList to see the performance
//any Iterable<Entry<String, DMatrix>> object could be used as watchList
List<Entry<String, DMatrix>> watchs = new ArrayList<>();
watchs.add(new SimpleEntry<>("train", trainMat));
watchs.add(new SimpleEntry<>("test", testMat));
int round = 2;
Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
* Saving model
After training, you can save model and dump it out.
.. code-block:: java
booster.saveModel("model.bin");
* Dump Model and Feature Map
.. code-block:: java
booster.dumpModel("modelInfo.txt", false)
//dump with featureMap
booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
* Load a model
.. code-block:: java
Params param = new Params() {
{
put("silent", 1);
put("nthread", 6);
}
};
Booster booster = new Booster(param, "model.bin");
**********
Prediction
**********
After training and loading a model, you can use it to make prediction for other data. The result will be a two-dimension float array ``(nsample, nclass)``; for ``predictLeaf()``, the result would be of shape ``(nsample, nclass*ntrees)``.
.. code-block:: java
DMatrix dtest = new DMatrix("test.svm.txt");
//predict
float[][] predicts = booster.predict(dtest);
//predict leaf
float[][] leafPredicts = booster.predict(dtest, 0, true);

View File

@@ -1,187 +0,0 @@
---
layout: post
title: XGBoost4J: Portable Distributed Tree Boosting in DataFlow
date: 2016-03-15 12:00:00
author: Nan Zhu, Tianqi Chen
comments: true
---
## Introduction
[XGBoost](https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. Gradient boosting trees model is originally proposed by Friedman et al. By embracing multi-threads and introducing regularization, XGBoost delivers higher computational power and more accurate prediction. **More than half of the winning solutions in machine learning challenges** hosted at Kaggle adopt XGBoost ([Incomplete list](https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)).
XGBoost has provided native interfaces for C++, R, python, Julia and Java users.
It is used by both [data exploration and production scenarios](https://github.com/dmlc/xgboost/tree/master/demo#usecases) to solve real world machine learning problems.
The distributed XGBoost is described in the [recently published paper](http://arxiv.org/abs/1603.02754).
In short, the XGBoost system runs magnitudes faster than existing alternatives of distributed ML,
and uses far fewer resources. The reader is more than welcomed to refer to the paper for more details.
Despite the current great success, one of our ultimate goals is to make XGBoost even more available for all production scenario.
Programming languages and data processing/storage systems based on Java Virtual Machine (JVM) play the significant roles in the BigData ecosystem. [Hadoop](http://hadoop.apache.org/), [Spark](http://spark.apache.org/) and more recently introduced [Flink](http://flink.apache.org/) are very useful solutions to general large-scale data processing.
On the other side, the emerging demands of machine learning and deep learning
inspires many excellent machine learning libraries.
Many of these machine learning libraries(e.g. [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet))
requires new computation abstraction and native support (e.g. C++ for GPU computing).
They are also often [much more efficient](http://arxiv.org/abs/1603.02754).
The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Spark/Flink to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file systems and then conduct the following machine learning phase. This process jumping across two types of systems creates certain inconvenience for the users and brings additional overhead to the operators of the infrastructure.
We want best of both worlds, so we can use the data processing frameworks like Spark and Flink together with
the best distributed machine learning solutions.
To resolve the situation, we introduce the new-brewed [XGBoost4J](https://github.com/dmlc/xgboost/tree/master/jvm-packages),
<b>XGBoost</b> for <b>J</b>VM Platform. We aim to provide the clean Java/Scala APIs and the integration with the most popular data processing systems developed in JVM-based languages.
## Unix Philosophy in Machine Learning
XGBoost and XGBoost4J adopts Unix Philosophy.
XGBoost **does its best in one thing -- tree boosting** and is **being designed to work with other systems**.
We strongly believe that machine learning solution should not be restricted to certain language or certain platform.
Specifically, users will be able to use distributed XGBoost in both Spark and Flink, and possibly more frameworks in Future.
We have made the API in a portable way so it **can be easily ported to other Dataflow frameworks provided by the Cloud**.
XGBoost4J shares its core with other XGBoost libraries, which means data scientists can use R/python
read and visualize the model trained distributedly.
It also means that user can start with single machine version for exploration,
which already can handle hundreds of million examples.
## System Overview
In the following Figure, we describe the overall architecture of XGBoost4J. XGBoost4J provides the Java/Scala API calling the core functionality of XGBoost library. Most importantly, it not only supports the single-machine model training, but also provides an abstraction layer which masks the difference of the underlying data processing engines and scales training to the distributed servers.
![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/xgboost4j.png)
By calling the XGBoost4J API, users can scale the model training to the cluster. XGBoost4J calls the running instance of XGBoost worker in Spark/Flink task and run them across the cluster. The communication among the distributed model training tasks and the XGBoost4J runtime environment go through [Rabit] (https://github.com/dmlc/rabit).
With the abstraction of XGBoost4J, users can build an unified data analytic application ranging from Extract-Transform-Loading, data exploration, machine learning model training and the final data product service. The following figure illustrate an example application built on top of Apache Spark. The application seamlessly embeds XGBoost into the processing pipeline and exchange data with other Spark-based processing phase through Spark's distributed memory layer.
![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png)
## Single-machine Training Walk-through
In this section, we will work through the APIs of XGBoost4J by examples.
We will be using scala for demonstration, but we also have a complete API for java users.
To start the model training and evaluation, we need to prepare the training and test set:
```scala
val trainMax = new DMatrix("../../demo/data/agaricus.txt.train")
val testMax = new DMatrix("../../demo/data/agaricus.txt.test")
```
After preparing the data, we can train our model:
```scala
val params = new mutable.HashMap[String, Any]()
params += "eta" -> 1.0
params += "max_depth" -> 2
params += "silent" -> 1
params += "objective" -> "binary:logistic"
val watches = new mutable.HashMap[String, DMatrix]
watches += "train" -> trainMax
watches += "test" -> testMax
val round = 2
// train a model
val booster = XGBoost.train(trainMax, params.toMap, round, watches.toMap)
```
We then evaluate our model:
```scala
val predicts = booster.predict(testMax)
```
`predict` can output the predict results and you can define a customized evaluation method to derive your own metrics (see the example in ([Customized Evaluation Metric in Java](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/CustomObjective.java), [Customized Evaluation Metric in Scala] (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/CustomObjective.scala)).
## Distributed Model Training with Distributed Dataflow Frameworks
The most exciting part in this XGBoost4J release is the integration with the Distributed Dataflow Framework. The most popular data processing frameworks fall into this category, e.g. [Apache Spark](http://spark.apache.org/), [Apache Flink] (http://flink.apache.org/), etc. In this part, we will walk through the steps to build the unified data analytic applications containing data preprocessing and distributed model training with Spark and Flink. (currently, we only provide Scala API for the integration with Spark and Flink)
Similar to the single-machine training, we need to prepare the training and test dataset.
### Spark Example
In Spark, the dataset is represented as the [Resilient Distributed Dataset (RDD)](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds), we can utilize the Spark-distributed tools to parse libSVM file and wrap it as the RDD:
```scala
val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).repartition(args(1).toInt)
```
We move forward to train the models:
```scala
val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound, numWorkers)
```
The next step is to evaluate the model, you can either predict in local side or in a distributed fashion
```scala
// testSet is an RDD containing testset data represented as
// org.apache.spark.mllib.regression.LabeledPoint
val testSet = MLUtils.loadLibSVMFile(sc, inputTestPath)
// local prediction
// import methods in DataUtils to convert Iterator[org.apache.spark.mllib.regression.LabeledPoint]
// to Iterator[ml.dmlc.xgboost4j.LabeledPoint] in automatic
import DataUtils._
xgboostModel.predict(new DMatrix(testSet.collect().iterator)
// distributed prediction
xgboostModel.predict(testSet)
```
### Flink example
In Flink, we represent training data as Flink's [DataSet](https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html)
```scala
val trainData = MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.train")
```
Model Training can be done as follows
```scala
val xgboostModel = XGBoost.train(trainData, paramMap, round)
```
Training and prediction.
```scala
// testData is a Dataset containing testset data represented as
// org.apache.flink.ml.math.Vector.LabeledVector
val testData = MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.test")
// local prediction
xgboostModel.predict(testData.collect().iterator)
// distributed prediction
xgboostModel.predict(testData.map{x => x.vector})
```
## Road Map
It is the first release of XGBoost4J package, we are actively move forward for more charming features in the next release. You can watch our progress in [XGBoost4J Road Map](https://github.com/dmlc/xgboost/issues/935).
While we are trying our best to keep the minimum changes to the APIs, it is still subject to the incompatible changes.
## Further Readings
If you are interested in knowing more about XGBoost, you can find rich resources in
- [The github repository of XGBoost](https://github.com/dmlc/xgboost)
- [The comprehensive documentation site for XGBoostl](http://xgboost.readthedocs.org/en/latest/index.html)
- [An introduction to the gradient boosting model](http://xgboost.readthedocs.org/en/latest/model.html)
- [Tutorials for the R package](xgboost.readthedocs.org/en/latest/R-package/index.html)
- [Introduction of the Parameters](http://xgboost.readthedocs.org/en/latest/parameter.html)
- [Awesome XGBoost, a curated list of examples, tutorials, blogs about XGBoost usecases](https://github.com/dmlc/xgboost/tree/master/demo)
## Acknowledgements
We would like to send many thanks to [Zixuan Huang](https://github.com/yanqingmen), the early developer of XGBoost for Java (XGBoost for Java).

View File

@@ -1,139 +0,0 @@
## Introduction
On March 2016, we released the first version of [XGBoost4J](http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), which is a set of packages providing Java/Scala interfaces of XGBoost and the integration with prevalent JVM-based distributed data processing platforms, like Spark/Flink.
The integrations with Spark/Flink, a.k.a. <b>XGBoost4J-Spark</b> and <b>XGBoost-Flink</b>, receive the tremendous positive feedbacks from the community. It enables users to build a unified pipeline, embedding XGBoost into the data processing system based on the widely-deployed frameworks like Spark. The following figure shows the general architecture of such a pipeline with the first version of <b>XGBoost4J-Spark</b>, where the data processing is based on the low-level [Resilient Distributed Dataset (RDD)](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds) abstraction.
![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png)
In the last months, we have a lot of communication with the users and gain the deeper understanding of the users' latest usage scenario and requirements:
* XGBoost keeps gaining more and more deployments in the production environment and the adoption in machine learning competitions [Link](http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/).
* While Spark is still the mainstream data processing tool in most of scenarios, more and more users are porting their RDD-based Spark programs to [DataFrame/Dataset APIs](http://spark.apache.org/docs/latest/sql-programming-guide.html) for the well-designed interfaces to manipulate structured data and the [significant performance improvement](https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html).
* Spark itself has presented a clear roadmap that DataFrame/Dataset would be the base of the latest and future features, e.g. latest version of [ML pipeline](http://spark.apache.org/docs/latest/ml-guide.html) and [Structured Streaming](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).
Based on these feedbacks from the users, we observe a gap between the original RDD-based XGBoost4J-Spark and the users' latest usage scenario as well as the future direction of Spark ecosystem. To fill this gap, we start working on the <b><i>integration of XGBoost and Spark's DataFrame/Dataset abstraction</i></b> in September. In this blog, we will introduce <b>the latest version of XGBoost4J-Spark</b> which allows the user to work with DataFrame/Dataset directly and embed XGBoost to Spark's ML pipeline seamlessly.
## A Full Integration of XGBoost and DataFrame/Dataset
The following figure illustrates the new pipeline architecture with the latest XGBoost4J-Spark.
![XGBoost4J New Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline_new.png)
Being different with the previous version, users are able to use both low- and high-level memory abstraction in Spark, i.e. RDD and DataFrame/Dataset. The DataFrame/Dataset abstraction grants the user to manipulate structured datasets and utilize the built-in routines in Spark or User Defined Functions (UDF) to explore the value distribution in columns before they feed data into the machine learning phase in the pipeline. In the following example, the structured sales records can be saved in a JSON file, parsed as DataFrame through Spark's API and feed to train XGBoost model in two lines of Scala code.
```scala
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// call XGBoost API to train with the DataFrame-represented training set
val xgboostModel = XGBoost.trainWithDataFrame(
salesDF, paramMap, numRound, nWorkers, useExternalMemory)
```
By integrating with DataFrame/Dataset, XGBoost4J-Spark not only enables users to call DataFrame/Dataset APIs directly but also make DataFrame/Dataset-based Spark features available to XGBoost users, e.g. ML Package.
### Integration with ML Package
ML package of Spark provides a set of convenient tools for feature extraction/transformation/selection. Additionally, with the model selection tool in ML package, users can select the best model through an automatic parameter searching process which is defined with through ML package APIs. After integrating with DataFrame/Dataset abstraction, these charming features in ML package are also available to XGBoost users.
#### Feature Extraction/Transformation/Selection
The following example shows a feature transformer which converts the string-typed storeType feature to the numeric storeTypeIndex. The transformed DataFrame is then fed to train XGBoost model.
```scala
import org.apache.spark.ml.feature.StringIndexer
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer()
.setInputCol("storeType")
.setOutputCol("storeTypeIndex")
// drop the extra column
val indexed = indexer.fit(salesDF).transform(df).drop("storeType")
// use the transformed dataframe as training dataset
val xgboostModel = XGBoost.trainWithDataFrame(
indexed, paramMap, numRound, nWorkers, useExternalMemory)
```
#### Pipelining
Spark ML package allows the user to build a complete pipeline from feature extraction/transformation/selection to model training. We integrate XGBoost with ML package and make it feasible to embed XGBoost into such a pipeline seamlessly. The following example shows how to build such a pipeline consisting of feature transformers and the XGBoost estimator.
```scala
import org.apache.spark.ml.feature.StringIndexer
// load sales records saved in json files
val salesDF = spark.read.json("sales.json")
// transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer()
.setInputCol("storeType")
.setOutputCol("storeTypeIndex")
// assemble the columns in dataframe into a vector
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("storeId", "storeTypeIndex", ...))
.setOutputCol("features")
// construct the pipeline
val pipeline = new Pipeline().setStages(
Array(storeTypeIndexer, ..., vectorAssembler, new XGBoostEstimator(Map[String, Any]("num_rounds" -> 100)))
// use the transformed dataframe as training dataset
val xgboostModel = pipeline.fit(salesDF)
// predict with the trained model
val salesTestDF = spark.read.json("sales_test.json")
val salesRecordsWithPred = xgboostModel.transform(salesTestDF)
```
#### Model Selection
The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model. Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark, we can utilize the Spark model selecting tool to automate this process. The following example shows the code snippet utilizing [TrainValidationSplit](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit) and [RegressionEvaluator](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator) to search the optimal combination of two XGBoost parameters, [max_depth and eta] (https://github.com/dmlc/xgboost/blob/master/doc/parameter.md). The model producing the minimum cost function value defined by RegressionEvaluator is selected and used to generate the prediction for the test set.
```scala
// create XGBoostEstimator
val xgbEstimator = new XGBoostEstimator(xgboostParam).setFeaturesCol("features").
setLabelCol("sales")
val paramGrid = new ParamGridBuilder()
.addGrid(xgbEstimator.maxDepth, Array(5, 6))
.addGrid(xgbEstimator.eta, Array(0.1, 0.4))
.build()
val tv = new TrainValidationSplit()
.setEstimator(xgbEstimator)
.setEvaluator(new RegressionEvaluator().setLabelCol("sales"))
.setEstimatorParamMaps(paramGrid)
.setTrainRatio(0.8)
val salesTestDF = spark.read.json("sales_test.json")
val salesRecordsWithPred = xgboostModel.transform(salesTestDF)
```
## Summary
Through the latest XGBoost4J-Spark, XGBoost users can build a more efficient data processing pipeline which works with DataFrame/Dataset APIs to handle the structured data with the excellent performance, and simultaneously embrace the powerful XGBoost to explore the insights from the dataset and transform this insight into action. Additionally, XGBoost4J-Spark seamlessly connect XGBoost with Spark ML package which makes the job of feature extraction/transformation/selection and parameter model much easier than before.
The latest version of XGBoost4J-Spark has been available in the [GitHub Repository] (https://github.com/dmlc/xgboost), and the latest API docs are in [here](http://xgboost.readthedocs.io/en/latest/jvm/index.html).
## Portable Machine Learning Systems
XGBoost is one of the projects incubated by [Distributed Machine Learning Community (DMLC)](http://dmlc.ml/), which also creates several other popular projects on machine learning systems ([Link](https://github.com/dmlc/)), e.g. one of the most popular deep learning frameworks, [MXNet](http://mxnet.io/). We strongly believe that machine learning solution should not be restricted to certain language or certain platform. We realize this design philosophy in several projects, like XGBoost and MXNet. We are willing to see more contributions from the community in this direction.
## Further Readings
If you are interested in knowing more about XGBoost, you can find rich resources in
- [The github repository of XGBoost](https://github.com/dmlc/xgboost)
- [The comprehensive documentation site for XGBoostl](http://xgboost.readthedocs.org/en/latest/index.html)
- [An introduction to the gradient boosting model](http://xgboost.readthedocs.org/en/latest/model.html)
- [Tutorials for the R package](xgboost.readthedocs.org/en/latest/R-package/index.html)
- [Introduction of the Parameters](http://xgboost.readthedocs.org/en/latest/parameter.html)
- [Awesome XGBoost, a curated list of examples, tutorials, blogs about XGBoost usecases](https://github.com/dmlc/xgboost/tree/master/demo)