Backport doc fixes that are compatible with 0.72 release
* Clarify behavior of LIBSVM in XGBoost4J-Spark (#3524) * Fix typo in faq.rst (#3521) * Fix typo in parameter.rst, gblinear section (#3518) * Clarify supported OSes for XGBoost4J published JARs (#3547) * Update broken links (#3565) * Grammar fixes and typos (#3568) * Bring XGBoost4J Intro up-to-date (#3574)
This commit is contained in:
parent
e19dded9a3
commit
4334b9cc91
@ -7,7 +7,7 @@ This document contains frequently asked questions about XGBoost.
|
|||||||
**********************
|
**********************
|
||||||
How to tune parameters
|
How to tune parameters
|
||||||
**********************
|
**********************
|
||||||
See :doc:`Parameter Tunning Guide </tutorials/param_tuning>`.
|
See :doc:`Parameter Tuning Guide </tutorials/param_tuning>`.
|
||||||
|
|
||||||
************************
|
************************
|
||||||
Description on the model
|
Description on the model
|
||||||
|
|||||||
@ -56,6 +56,13 @@ For sbt, please add the repository and dependency in build.sbt as following:
|
|||||||
|
|
||||||
"ml.dmlc" % "xgboost4j" % "latest_source_version_num"
|
"ml.dmlc" % "xgboost4j" % "latest_source_version_num"
|
||||||
|
|
||||||
|
If you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
|
||||||
|
|
||||||
|
.. note:: Spark 2.0 Required
|
||||||
|
|
||||||
|
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
|
||||||
|
|
||||||
|
|
||||||
Installation from maven repo
|
Installation from maven repo
|
||||||
============================
|
============================
|
||||||
|
|
||||||
@ -76,9 +83,11 @@ Access release version
|
|||||||
|
|
||||||
"ml.dmlc" % "xgboost4j" % "latest_version_num"
|
"ml.dmlc" % "xgboost4j" % "latest_version_num"
|
||||||
|
|
||||||
|
This will checkout the latest stable version from the Maven Central.
|
||||||
|
|
||||||
For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.
|
For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.
|
||||||
|
|
||||||
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
|
if you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
|
||||||
|
|
||||||
Access SNAPSHOT version
|
Access SNAPSHOT version
|
||||||
-----------------------
|
-----------------------
|
||||||
@ -117,9 +126,9 @@ Then add dependency as following:
|
|||||||
|
|
||||||
For the latest release version number, please check `here <https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j>`_.
|
For the latest release version number, please check `here <https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j>`_.
|
||||||
|
|
||||||
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
|
.. note:: Windows not supported by published JARs
|
||||||
|
|
||||||
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
|
The published JARs from the Maven Central and GitHub currently only supports Linux and MacOS. Windows users should consider building XGBoost4J / XGBoost4J-Spark from the source. Alternatively, checkout pre-built JARs from `criteo-forks/xgboost-jars <https://github.com/criteo-forks/xgboost-jars>`_.
|
||||||
|
|
||||||
Enabling OpenMP for Mac OS
|
Enabling OpenMP for Mac OS
|
||||||
--------------------------
|
--------------------------
|
||||||
@ -136,8 +145,9 @@ Contents
|
|||||||
********
|
********
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
Java Overview Tutorial <java_intro>
|
java_intro
|
||||||
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
|
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
|
||||||
XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
|
XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
|
||||||
XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>
|
XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>
|
||||||
|
|||||||
@ -1,28 +1,28 @@
|
|||||||
##################
|
##############################
|
||||||
XGBoost4J Java API
|
Getting Started with XGBoost4J
|
||||||
##################
|
##############################
|
||||||
This tutorial introduces Java API for XGBoost.
|
This tutorial introduces Java API for XGBoost.
|
||||||
|
|
||||||
**************
|
**************
|
||||||
Data Interface
|
Data Interface
|
||||||
**************
|
**************
|
||||||
Like the XGBoost python module, XGBoost4J uses ``DMatrix`` to handle data,
|
Like the XGBoost python module, XGBoost4J uses DMatrix to handle data.
|
||||||
libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
|
LIBSVM txt format file, sparse matrix in CSR/CSC format, and dense matrix are
|
||||||
supported.
|
supported.
|
||||||
|
|
||||||
* The first step is to import ``DMatrix``:
|
* The first step is to import DMatrix:
|
||||||
|
|
||||||
.. code-block:: java
|
.. code-block:: java
|
||||||
|
|
||||||
import org.dmlc.xgboost4j.DMatrix;
|
import ml.dmlc.xgboost4j.java.DMatrix;
|
||||||
|
|
||||||
* Use ``DMatrix`` constructor to load data from a libsvm text format file:
|
* Use DMatrix constructor to load data from a libsvm text format file:
|
||||||
|
|
||||||
.. code-block:: java
|
.. code-block:: java
|
||||||
|
|
||||||
DMatrix dmat = new DMatrix("train.svm.txt");
|
DMatrix dmat = new DMatrix("train.svm.txt");
|
||||||
|
|
||||||
* Pass arrays to ``DMatrix`` constructor to load from sparse matrix.
|
* Pass arrays to DMatrix constructor to load from sparse matrix.
|
||||||
|
|
||||||
Suppose we have a sparse matrix
|
Suppose we have a sparse matrix
|
||||||
|
|
||||||
@ -39,7 +39,8 @@ supported.
|
|||||||
long[] rowHeaders = new long[] {0,2,4,7};
|
long[] rowHeaders = new long[] {0,2,4,7};
|
||||||
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
|
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
|
||||||
int[] colIndex = new int[] {0,2,0,3,0,1,2};
|
int[] colIndex = new int[] {0,2,0,3,0,1,2};
|
||||||
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
|
int numColumn = 4;
|
||||||
|
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR, numColumn);
|
||||||
|
|
||||||
... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
|
... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
|
||||||
|
|
||||||
@ -48,7 +49,8 @@ supported.
|
|||||||
long[] colHeaders = new long[] {0,3,4,6,7};
|
long[] colHeaders = new long[] {0,3,4,6,7};
|
||||||
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
|
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
|
||||||
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
|
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
|
||||||
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
|
int numRow = 3;
|
||||||
|
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC, numRow);
|
||||||
|
|
||||||
* You may also load your data from a dense matrix. Let's assume we have a matrix of form
|
* You may also load your data from a dense matrix. Let's assume we have a matrix of form
|
||||||
|
|
||||||
@ -66,7 +68,7 @@ supported.
|
|||||||
int nrow = 3;
|
int nrow = 3;
|
||||||
int ncol = 2;
|
int ncol = 2;
|
||||||
float missing = 0.0f;
|
float missing = 0.0f;
|
||||||
DMatrix dmat = new Matrix(data, nrow, ncol, missing);
|
DMatrix dmat = new DMatrix(data, nrow, ncol, missing);
|
||||||
|
|
||||||
* To set weight:
|
* To set weight:
|
||||||
|
|
||||||
@ -78,47 +80,31 @@ supported.
|
|||||||
******************
|
******************
|
||||||
Setting Parameters
|
Setting Parameters
|
||||||
******************
|
******************
|
||||||
* In XGBoost4J any ``Iterable<Entry<String, Object>>`` object could be used as parameters.
|
To set parameters, parameters are specified as a Map:
|
||||||
|
|
||||||
* To set parameters, for non-multiple value params, you can simply use entrySet of an Map:
|
.. code-block:: java
|
||||||
|
|
||||||
.. code-block:: java
|
Map<String, Object> params = new HashMap<String, Object>() {
|
||||||
|
{
|
||||||
Map<String, Object> paramMap = new HashMap<>() {
|
put("eta", 1.0);
|
||||||
{
|
put("max_depth", 2);
|
||||||
put("eta", 1.0);
|
put("silent", 1);
|
||||||
put("max_depth", 2);
|
put("objective", "binary:logistic");
|
||||||
put("silent", 1);
|
put("eval_metric", "logloss");
|
||||||
put("objective", "binary:logistic");
|
}
|
||||||
put("eval_metric", "logloss");
|
};
|
||||||
}
|
|
||||||
};
|
|
||||||
Iterable<Entry<String, Object>> params = paramMap.entrySet();
|
|
||||||
|
|
||||||
* for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
|
|
||||||
|
|
||||||
.. code-block:: java
|
|
||||||
|
|
||||||
List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
|
|
||||||
{
|
|
||||||
add(new SimpleEntry<String, Object>("eta", 1.0));
|
|
||||||
add(new SimpleEntry<String, Object>("max_depth", 2.0));
|
|
||||||
add(new SimpleEntry<String, Object>("silent", 1));
|
|
||||||
add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
**************
|
**************
|
||||||
Training Model
|
Training Model
|
||||||
**************
|
**************
|
||||||
With parameters and data, you are able to train a booster model.
|
With parameters and data, you are able to train a booster model.
|
||||||
|
|
||||||
* Import ``Trainer`` and ``Booster``:
|
* Import Booster and XGBoost:
|
||||||
|
|
||||||
.. code-block:: java
|
.. code-block:: java
|
||||||
|
|
||||||
import org.dmlc.xgboost4j.Booster;
|
import ml.dmlc.xgboost4j.java.Booster;
|
||||||
import org.dmlc.xgboost4j.util.Trainer;
|
import ml.dmlc.xgboost4j.java.XGBoost;
|
||||||
|
|
||||||
* Training
|
* Training
|
||||||
|
|
||||||
@ -126,13 +112,15 @@ With parameters and data, you are able to train a booster model.
|
|||||||
|
|
||||||
DMatrix trainMat = new DMatrix("train.svm.txt");
|
DMatrix trainMat = new DMatrix("train.svm.txt");
|
||||||
DMatrix validMat = new DMatrix("valid.svm.txt");
|
DMatrix validMat = new DMatrix("valid.svm.txt");
|
||||||
//specify a watchList to see the performance
|
// Specify a watch list to see model accuracy on data sets
|
||||||
//any Iterable<Entry<String, DMatrix>> object could be used as watchList
|
Map<String, DMatrix> watches = new HashMap<String, DMatrix>() {
|
||||||
List<Entry<String, DMatrix>> watchs = new ArrayList<>();
|
{
|
||||||
watchs.add(new SimpleEntry<>("train", trainMat));
|
put("train", trainMat);
|
||||||
watchs.add(new SimpleEntry<>("test", testMat));
|
put("test", testMat);
|
||||||
int round = 2;
|
}
|
||||||
Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
|
};
|
||||||
|
int nround = 2;
|
||||||
|
Booster booster = XGBoost.train(trainMat, params, nround, watches, null, null);
|
||||||
|
|
||||||
* Saving model
|
* Saving model
|
||||||
|
|
||||||
@ -142,25 +130,20 @@ With parameters and data, you are able to train a booster model.
|
|||||||
|
|
||||||
booster.saveModel("model.bin");
|
booster.saveModel("model.bin");
|
||||||
|
|
||||||
* Dump Model and Feature Map
|
* Generaing model dump with feature map
|
||||||
|
|
||||||
.. code-block:: java
|
.. code-block:: java
|
||||||
|
|
||||||
booster.dumpModel("modelInfo.txt", false)
|
// dump without feature map
|
||||||
//dump with featureMap
|
String[] model_dump = booster.getModelDump(null, false);
|
||||||
booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
|
// dump with feature map
|
||||||
|
String[] model_dump_with_feature_map = booster.getModelDump("featureMap.txt", false);
|
||||||
|
|
||||||
* Load a model
|
* Load a model
|
||||||
|
|
||||||
.. code-block:: java
|
.. code-block:: java
|
||||||
|
|
||||||
Params param = new Params() {
|
Booster booster = XGBoost.loadModel("model.bin");
|
||||||
{
|
|
||||||
put("silent", 1);
|
|
||||||
put("nthread", 6);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
Booster booster = new Booster(param, "model.bin");
|
|
||||||
|
|
||||||
**********
|
**********
|
||||||
Prediction
|
Prediction
|
||||||
@ -170,8 +153,8 @@ After training and loading a model, you can use it to make prediction for other
|
|||||||
.. code-block:: java
|
.. code-block:: java
|
||||||
|
|
||||||
DMatrix dtest = new DMatrix("test.svm.txt");
|
DMatrix dtest = new DMatrix("test.svm.txt");
|
||||||
//predict
|
// predict
|
||||||
float[][] predicts = booster.predict(dtest);
|
float[][] predicts = booster.predict(dtest);
|
||||||
//predict leaf
|
// predict leaf
|
||||||
float[][] leafPredicts = booster.predict(dtest, 0, true);
|
float[][] leafPredicts = booster.predictLeaf(dtest, 0);
|
||||||
|
|
||||||
|
|||||||
@ -211,7 +211,7 @@ Additional parameters for Dart Booster (``booster=dart``)
|
|||||||
|
|
||||||
- range: [0.0, 1.0]
|
- range: [0.0, 1.0]
|
||||||
|
|
||||||
Parameters for Linear Booster (``booster=gbtree``)
|
Parameters for Linear Booster (``booster=gblinear``)
|
||||||
==================================================
|
==================================================
|
||||||
* ``lambda`` [default=0, alias: ``reg_lambda``]
|
* ``lambda`` [default=0, alias: ``reg_lambda``]
|
||||||
|
|
||||||
@ -280,7 +280,7 @@ Specify the learning task and the corresponding learning objective. The objectiv
|
|||||||
- ``error``: Binary classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
- ``error``: Binary classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
||||||
- ``error@t``: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through 't'.
|
- ``error@t``: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through 't'.
|
||||||
- ``merror``: Multiclass classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``.
|
- ``merror``: Multiclass classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``.
|
||||||
- ``mlogloss``: `Multiclass logloss <https://www.kaggle.com/wiki/LogLoss>`_.
|
- ``mlogloss``: `Multiclass logloss <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html>`_.
|
||||||
- ``auc``: `Area under the curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve>`_
|
- ``auc``: `Area under the curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve>`_
|
||||||
- ``ndcg``: `Normalized Discounted Cumulative Gain <http://en.wikipedia.org/wiki/NDCG>`_
|
- ``ndcg``: `Normalized Discounted Cumulative Gain <http://en.wikipedia.org/wiki/NDCG>`_
|
||||||
- ``map``: `Mean average precision <http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision>`_
|
- ``map``: `Mean average precision <http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision>`_
|
||||||
|
|||||||
@ -19,7 +19,7 @@ However, such complicated model requires more data to fit.
|
|||||||
Most of parameters in XGBoost are about bias variance tradeoff. The best model
|
Most of parameters in XGBoost are about bias variance tradeoff. The best model
|
||||||
should trade the model complexity with its predictive power carefully.
|
should trade the model complexity with its predictive power carefully.
|
||||||
:doc:`Parameters Documentation </parameter>` will tell you whether each parameter
|
:doc:`Parameters Documentation </parameter>` will tell you whether each parameter
|
||||||
ill make the model more conservative or not. This can be used to help you
|
will make the model more conservative or not. This can be used to help you
|
||||||
turn the knob between complicated model and simple model.
|
turn the knob between complicated model and simple model.
|
||||||
|
|
||||||
*******************
|
*******************
|
||||||
@ -27,16 +27,16 @@ Control Overfitting
|
|||||||
*******************
|
*******************
|
||||||
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.
|
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.
|
||||||
|
|
||||||
There are in general two ways that you can control overfitting in XGBoost
|
There are in general two ways that you can control overfitting in XGBoost:
|
||||||
|
|
||||||
* The first way is to directly control model complexity
|
* The first way is to directly control model complexity.
|
||||||
|
|
||||||
- This include ``max_depth``, ``min_child_weight`` and ``gamma``
|
- This includes ``max_depth``, ``min_child_weight`` and ``gamma``.
|
||||||
|
|
||||||
* The second way is to add randomness to make training robust to noise
|
* The second way is to add randomness to make training robust to noise.
|
||||||
|
|
||||||
- This include ``subsample`` and ``colsample_bytree``.
|
- This includes ``subsample`` and ``colsample_bytree``.
|
||||||
- You can also reduce stepsize ``eta``. Rremember to increase ``num_round`` when you do so.
|
- You can also reduce stepsize ``eta``. Remember to increase ``num_round`` when you do so.
|
||||||
|
|
||||||
*************************
|
*************************
|
||||||
Handle Imbalanced Dataset
|
Handle Imbalanced Dataset
|
||||||
|
|||||||
@ -68,8 +68,12 @@ be found in the [examples package](https://github.com/dmlc/xgboost/tree/master/j
|
|||||||
|
|
||||||
**NOTE on LIBSVM Format**:
|
**NOTE on LIBSVM Format**:
|
||||||
|
|
||||||
* Use *1-based* ascending indexes for the LIBSVM format in distributed training mode
|
There is an inconsistent issue between XGBoost4J-Spark and other language bindings of XGBoost.
|
||||||
|
|
||||||
* Spark does the internal conversion, and does not accept formats that are 0-based
|
When users use Spark to load trainingset/testset in LibSVM format with the following code snippet:
|
||||||
|
|
||||||
* Whereas, use *0-based* indexes format when predicting in normal mode - for instance, while using the saved model in the Python package
|
```scala
|
||||||
|
spark.read.format("libsvm").load("trainingset_libsvm")
|
||||||
|
```
|
||||||
|
|
||||||
|
Spark assumes that the dataset is 1-based indexed. However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is 0-based indexed. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user