Backport doc fixes that are compatible with 0.72 release
* Clarify behavior of LIBSVM in XGBoost4J-Spark (#3524) * Fix typo in faq.rst (#3521) * Fix typo in parameter.rst, gblinear section (#3518) * Clarify supported OSes for XGBoost4J published JARs (#3547) * Update broken links (#3565) * Grammar fixes and typos (#3568) * Bring XGBoost4J Intro up-to-date (#3574)
This commit is contained in:
parent
e19dded9a3
commit
4334b9cc91
@ -7,7 +7,7 @@ This document contains frequently asked questions about XGBoost.
|
||||
**********************
|
||||
How to tune parameters
|
||||
**********************
|
||||
See :doc:`Parameter Tunning Guide </tutorials/param_tuning>`.
|
||||
See :doc:`Parameter Tuning Guide </tutorials/param_tuning>`.
|
||||
|
||||
************************
|
||||
Description on the model
|
||||
|
||||
@ -56,6 +56,13 @@ For sbt, please add the repository and dependency in build.sbt as following:
|
||||
|
||||
"ml.dmlc" % "xgboost4j" % "latest_source_version_num"
|
||||
|
||||
If you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
|
||||
|
||||
.. note:: Spark 2.0 Required
|
||||
|
||||
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
|
||||
|
||||
|
||||
Installation from maven repo
|
||||
============================
|
||||
|
||||
@ -76,9 +83,11 @@ Access release version
|
||||
|
||||
"ml.dmlc" % "xgboost4j" % "latest_version_num"
|
||||
|
||||
This will checkout the latest stable version from the Maven Central.
|
||||
|
||||
For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.
|
||||
|
||||
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
|
||||
if you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
|
||||
|
||||
Access SNAPSHOT version
|
||||
-----------------------
|
||||
@ -117,9 +126,9 @@ Then add dependency as following:
|
||||
|
||||
For the latest release version number, please check `here <https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j>`_.
|
||||
|
||||
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
|
||||
.. note:: Windows not supported by published JARs
|
||||
|
||||
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
|
||||
The published JARs from the Maven Central and GitHub currently only supports Linux and MacOS. Windows users should consider building XGBoost4J / XGBoost4J-Spark from the source. Alternatively, checkout pre-built JARs from `criteo-forks/xgboost-jars <https://github.com/criteo-forks/xgboost-jars>`_.
|
||||
|
||||
Enabling OpenMP for Mac OS
|
||||
--------------------------
|
||||
@ -136,8 +145,9 @@ Contents
|
||||
********
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
Java Overview Tutorial <java_intro>
|
||||
java_intro
|
||||
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
|
||||
XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
|
||||
XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>
|
||||
|
||||
@ -1,28 +1,28 @@
|
||||
##################
|
||||
XGBoost4J Java API
|
||||
##################
|
||||
##############################
|
||||
Getting Started with XGBoost4J
|
||||
##############################
|
||||
This tutorial introduces Java API for XGBoost.
|
||||
|
||||
**************
|
||||
Data Interface
|
||||
**************
|
||||
Like the XGBoost python module, XGBoost4J uses ``DMatrix`` to handle data,
|
||||
libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
|
||||
Like the XGBoost python module, XGBoost4J uses DMatrix to handle data.
|
||||
LIBSVM txt format file, sparse matrix in CSR/CSC format, and dense matrix are
|
||||
supported.
|
||||
|
||||
* The first step is to import ``DMatrix``:
|
||||
* The first step is to import DMatrix:
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
import org.dmlc.xgboost4j.DMatrix;
|
||||
import ml.dmlc.xgboost4j.java.DMatrix;
|
||||
|
||||
* Use ``DMatrix`` constructor to load data from a libsvm text format file:
|
||||
* Use DMatrix constructor to load data from a libsvm text format file:
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
DMatrix dmat = new DMatrix("train.svm.txt");
|
||||
|
||||
* Pass arrays to ``DMatrix`` constructor to load from sparse matrix.
|
||||
* Pass arrays to DMatrix constructor to load from sparse matrix.
|
||||
|
||||
Suppose we have a sparse matrix
|
||||
|
||||
@ -39,7 +39,8 @@ supported.
|
||||
long[] rowHeaders = new long[] {0,2,4,7};
|
||||
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
|
||||
int[] colIndex = new int[] {0,2,0,3,0,1,2};
|
||||
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
|
||||
int numColumn = 4;
|
||||
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR, numColumn);
|
||||
|
||||
... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
|
||||
|
||||
@ -48,7 +49,8 @@ supported.
|
||||
long[] colHeaders = new long[] {0,3,4,6,7};
|
||||
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
|
||||
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
|
||||
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
|
||||
int numRow = 3;
|
||||
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC, numRow);
|
||||
|
||||
* You may also load your data from a dense matrix. Let's assume we have a matrix of form
|
||||
|
||||
@ -66,7 +68,7 @@ supported.
|
||||
int nrow = 3;
|
||||
int ncol = 2;
|
||||
float missing = 0.0f;
|
||||
DMatrix dmat = new Matrix(data, nrow, ncol, missing);
|
||||
DMatrix dmat = new DMatrix(data, nrow, ncol, missing);
|
||||
|
||||
* To set weight:
|
||||
|
||||
@ -78,47 +80,31 @@ supported.
|
||||
******************
|
||||
Setting Parameters
|
||||
******************
|
||||
* In XGBoost4J any ``Iterable<Entry<String, Object>>`` object could be used as parameters.
|
||||
To set parameters, parameters are specified as a Map:
|
||||
|
||||
* To set parameters, for non-multiple value params, you can simply use entrySet of an Map:
|
||||
.. code-block:: java
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
Map<String, Object> paramMap = new HashMap<>() {
|
||||
{
|
||||
put("eta", 1.0);
|
||||
put("max_depth", 2);
|
||||
put("silent", 1);
|
||||
put("objective", "binary:logistic");
|
||||
put("eval_metric", "logloss");
|
||||
}
|
||||
};
|
||||
Iterable<Entry<String, Object>> params = paramMap.entrySet();
|
||||
|
||||
* for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
|
||||
{
|
||||
add(new SimpleEntry<String, Object>("eta", 1.0));
|
||||
add(new SimpleEntry<String, Object>("max_depth", 2.0));
|
||||
add(new SimpleEntry<String, Object>("silent", 1));
|
||||
add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
|
||||
}
|
||||
};
|
||||
Map<String, Object> params = new HashMap<String, Object>() {
|
||||
{
|
||||
put("eta", 1.0);
|
||||
put("max_depth", 2);
|
||||
put("silent", 1);
|
||||
put("objective", "binary:logistic");
|
||||
put("eval_metric", "logloss");
|
||||
}
|
||||
};
|
||||
|
||||
**************
|
||||
Training Model
|
||||
**************
|
||||
With parameters and data, you are able to train a booster model.
|
||||
|
||||
* Import ``Trainer`` and ``Booster``:
|
||||
* Import Booster and XGBoost:
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
import org.dmlc.xgboost4j.Booster;
|
||||
import org.dmlc.xgboost4j.util.Trainer;
|
||||
import ml.dmlc.xgboost4j.java.Booster;
|
||||
import ml.dmlc.xgboost4j.java.XGBoost;
|
||||
|
||||
* Training
|
||||
|
||||
@ -126,13 +112,15 @@ With parameters and data, you are able to train a booster model.
|
||||
|
||||
DMatrix trainMat = new DMatrix("train.svm.txt");
|
||||
DMatrix validMat = new DMatrix("valid.svm.txt");
|
||||
//specify a watchList to see the performance
|
||||
//any Iterable<Entry<String, DMatrix>> object could be used as watchList
|
||||
List<Entry<String, DMatrix>> watchs = new ArrayList<>();
|
||||
watchs.add(new SimpleEntry<>("train", trainMat));
|
||||
watchs.add(new SimpleEntry<>("test", testMat));
|
||||
int round = 2;
|
||||
Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
|
||||
// Specify a watch list to see model accuracy on data sets
|
||||
Map<String, DMatrix> watches = new HashMap<String, DMatrix>() {
|
||||
{
|
||||
put("train", trainMat);
|
||||
put("test", testMat);
|
||||
}
|
||||
};
|
||||
int nround = 2;
|
||||
Booster booster = XGBoost.train(trainMat, params, nround, watches, null, null);
|
||||
|
||||
* Saving model
|
||||
|
||||
@ -142,25 +130,20 @@ With parameters and data, you are able to train a booster model.
|
||||
|
||||
booster.saveModel("model.bin");
|
||||
|
||||
* Dump Model and Feature Map
|
||||
* Generaing model dump with feature map
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
booster.dumpModel("modelInfo.txt", false)
|
||||
//dump with featureMap
|
||||
booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
|
||||
// dump without feature map
|
||||
String[] model_dump = booster.getModelDump(null, false);
|
||||
// dump with feature map
|
||||
String[] model_dump_with_feature_map = booster.getModelDump("featureMap.txt", false);
|
||||
|
||||
* Load a model
|
||||
|
||||
.. code-block:: java
|
||||
|
||||
Params param = new Params() {
|
||||
{
|
||||
put("silent", 1);
|
||||
put("nthread", 6);
|
||||
}
|
||||
};
|
||||
Booster booster = new Booster(param, "model.bin");
|
||||
Booster booster = XGBoost.loadModel("model.bin");
|
||||
|
||||
**********
|
||||
Prediction
|
||||
@ -170,8 +153,8 @@ After training and loading a model, you can use it to make prediction for other
|
||||
.. code-block:: java
|
||||
|
||||
DMatrix dtest = new DMatrix("test.svm.txt");
|
||||
//predict
|
||||
// predict
|
||||
float[][] predicts = booster.predict(dtest);
|
||||
//predict leaf
|
||||
float[][] leafPredicts = booster.predict(dtest, 0, true);
|
||||
// predict leaf
|
||||
float[][] leafPredicts = booster.predictLeaf(dtest, 0);
|
||||
|
||||
|
||||
@ -211,7 +211,7 @@ Additional parameters for Dart Booster (``booster=dart``)
|
||||
|
||||
- range: [0.0, 1.0]
|
||||
|
||||
Parameters for Linear Booster (``booster=gbtree``)
|
||||
Parameters for Linear Booster (``booster=gblinear``)
|
||||
==================================================
|
||||
* ``lambda`` [default=0, alias: ``reg_lambda``]
|
||||
|
||||
@ -280,7 +280,7 @@ Specify the learning task and the corresponding learning objective. The objectiv
|
||||
- ``error``: Binary classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
||||
- ``error@t``: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through 't'.
|
||||
- ``merror``: Multiclass classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``.
|
||||
- ``mlogloss``: `Multiclass logloss <https://www.kaggle.com/wiki/LogLoss>`_.
|
||||
- ``mlogloss``: `Multiclass logloss <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html>`_.
|
||||
- ``auc``: `Area under the curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve>`_
|
||||
- ``ndcg``: `Normalized Discounted Cumulative Gain <http://en.wikipedia.org/wiki/NDCG>`_
|
||||
- ``map``: `Mean average precision <http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision>`_
|
||||
|
||||
@ -19,7 +19,7 @@ However, such complicated model requires more data to fit.
|
||||
Most of parameters in XGBoost are about bias variance tradeoff. The best model
|
||||
should trade the model complexity with its predictive power carefully.
|
||||
:doc:`Parameters Documentation </parameter>` will tell you whether each parameter
|
||||
ill make the model more conservative or not. This can be used to help you
|
||||
will make the model more conservative or not. This can be used to help you
|
||||
turn the knob between complicated model and simple model.
|
||||
|
||||
*******************
|
||||
@ -27,16 +27,16 @@ Control Overfitting
|
||||
*******************
|
||||
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.
|
||||
|
||||
There are in general two ways that you can control overfitting in XGBoost
|
||||
There are in general two ways that you can control overfitting in XGBoost:
|
||||
|
||||
* The first way is to directly control model complexity
|
||||
* The first way is to directly control model complexity.
|
||||
|
||||
- This include ``max_depth``, ``min_child_weight`` and ``gamma``
|
||||
- This includes ``max_depth``, ``min_child_weight`` and ``gamma``.
|
||||
|
||||
* The second way is to add randomness to make training robust to noise
|
||||
* The second way is to add randomness to make training robust to noise.
|
||||
|
||||
- This include ``subsample`` and ``colsample_bytree``.
|
||||
- You can also reduce stepsize ``eta``. Rremember to increase ``num_round`` when you do so.
|
||||
- This includes ``subsample`` and ``colsample_bytree``.
|
||||
- You can also reduce stepsize ``eta``. Remember to increase ``num_round`` when you do so.
|
||||
|
||||
*************************
|
||||
Handle Imbalanced Dataset
|
||||
|
||||
@ -68,8 +68,12 @@ be found in the [examples package](https://github.com/dmlc/xgboost/tree/master/j
|
||||
|
||||
**NOTE on LIBSVM Format**:
|
||||
|
||||
* Use *1-based* ascending indexes for the LIBSVM format in distributed training mode
|
||||
There is an inconsistent issue between XGBoost4J-Spark and other language bindings of XGBoost.
|
||||
|
||||
* Spark does the internal conversion, and does not accept formats that are 0-based
|
||||
When users use Spark to load trainingset/testset in LibSVM format with the following code snippet:
|
||||
|
||||
* Whereas, use *0-based* indexes format when predicting in normal mode - for instance, while using the saved model in the Python package
|
||||
```scala
|
||||
spark.read.format("libsvm").load("trainingset_libsvm")
|
||||
```
|
||||
|
||||
Spark assumes that the dataset is 1-based indexed. However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is 0-based indexed. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user