xgboost/doc/jvm/java_intro.rst
Nan Zhu 31d1baba3d [jvm-packages] Tutorial of XGBoost4J-Spark (#3534)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* add new

* update doc

* finish Gang Scheduling

* more

* intro

* Add sections: Prediction, Model persistence and ML pipeline.

* Add XGBoost4j-Spark MLlib pipeline example

* partial finished version

* finish the doc

* adjust code

* fix the doc

* use rst

* Convert XGBoost4J-Spark tutorial to reST

* Bring XGBoost4J up to date

* add note about using hdfs

* remove duplicate file

* fix descriptions

* update doc

* Wrap HDFS/S3 export support as a note

* update

* wrap indexing_mode example in code block
2018-08-03 21:17:50 -07:00

156 lines
4.2 KiB
ReStructuredText

##############################
Getting Started with XGBoost4J
##############################
This tutorial introduces Java API for XGBoost.
**************
Data Interface
**************
Like the XGBoost python module, XGBoost4J uses DMatrix to handle data,
LIBSVM txt format file, sparse matrix in CSR/CSC format, and dense matrix is
supported.
* The first step is to import DMatrix:
.. code-block:: java
import org.dmlc.xgboost4j.java.DMatrix;
* Use DMatrix constructor to load data from a libsvm text format file:
.. code-block:: java
DMatrix dmat = new DMatrix("train.svm.txt");
* Pass arrays to DMatrix constructor to load from sparse matrix.
Suppose we have a sparse matrix
.. code-block:: none
1 0 2 0
4 0 0 3
3 1 2 0
We can express the sparse matrix in `Compressed Sparse Row (CSR) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)>`_ format:
.. code-block:: java
long[] rowHeaders = new long[] {0,2,4,7};
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
int[] colIndex = new int[] {0,2,0,3,0,1,2};
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
.. code-block:: java
long[] colHeaders = new long[] {0,3,4,6,7};
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
* You may also load your data from a dense matrix. Let's assume we have a matrix of form
.. code-block:: none
1 2
3 4
5 6
Using `row-major layout <https://en.wikipedia.org/wiki/Row-_and_column-major_order>`_, we specify the dense matrix as follows:
.. code-block:: java
float[] data = new float[] {1f,2f,3f,4f,5f,6f};
int nrow = 3;
int ncol = 2;
float missing = 0.0f;
DMatrix dmat = new Matrix(data, nrow, ncol, missing);
* To set weight:
.. code-block:: java
float[] weights = new float[] {1f,2f,1f};
dmat.setWeight(weights);
******************
Setting Parameters
******************
To set parameters, parameters are specified as a Map:
.. code-block:: java
Map<String, Object> params = new HashMap<>() {
{
put("eta", 1.0);
put("max_depth", 2);
put("silent", 1);
put("objective", "binary:logistic");
put("eval_metric", "logloss");
}
};
**************
Training Model
**************
With parameters and data, you are able to train a booster model.
* Import Booster and XGBoost:
.. code-block:: java
import org.dmlc.xgboost4j.java.Booster;
import org.dmlc.xgboost4j.java.XGBoost;
* Training
.. code-block:: java
DMatrix trainMat = new DMatrix("train.svm.txt");
DMatrix validMat = new DMatrix("valid.svm.txt");
// Specify a watchList to see the performance
// Any Iterable<Entry<String, DMatrix>> object could be used as watchList
List<Entry<String, DMatrix>> watches = new ArrayList<>();
watches.add(new SimpleEntry<>("train", trainMat));
watches.add(new SimpleEntry<>("test", testMat));
int nround = 2;
Booster booster = XGBoost.train(trainMat, params, nround, watches, null, null);
* Saving model
After training, you can save model and dump it out.
.. code-block:: java
booster.saveModel("model.bin");
* Generaing model dump with feature map
.. code-block:: java
String[] model_dump = booster.getModelDump(null, false)
// dump with feature map
String[] model_dump_with_feature_map = booster.getModelDump("featureMap.txt", false)
* Load a model
.. code-block:: java
Booster booster = Booster.loadModel("model.bin");
**********
Prediction
**********
After training and loading a model, you can use it to make prediction for other data. The result will be a two-dimension float array ``(nsample, nclass)``; for ``predictLeaf()``, the result would be of shape ``(nsample, nclass*ntrees)``.
.. code-block:: java
DMatrix dtest = new DMatrix("test.svm.txt");
// predict
float[][] predicts = booster.predict(dtest);
// predict leaf
float[][] leafPredicts = booster.predictLeaf(dtest, 0);