Backport doc fixes that are compatible with 0.72 release

* Clarify behavior of LIBSVM in XGBoost4J-Spark (#3524)
* Fix typo in faq.rst (#3521)
* Fix typo in parameter.rst, gblinear section (#3518)
* Clarify supported OSes for XGBoost4J published JARs (#3547)
* Update broken links (#3565)
* Grammar fixes and typos (#3568)
* Bring XGBoost4J Intro up-to-date (#3574)
This commit is contained in:
Nan Zhu 2018-07-28 17:34:39 -07:00 committed by Philip Cho
parent e19dded9a3
commit 4334b9cc91
No known key found for this signature in database
GPG Key ID: A758FA046E1F6BB8
6 changed files with 77 additions and 80 deletions

View File

@ -7,7 +7,7 @@ This document contains frequently asked questions about XGBoost.
**********************
How to tune parameters
**********************
See :doc:`Parameter Tunning Guide </tutorials/param_tuning>`.
See :doc:`Parameter Tuning Guide </tutorials/param_tuning>`.
************************
Description on the model

View File

@ -56,6 +56,13 @@ For sbt, please add the repository and dependency in build.sbt as following:
"ml.dmlc" % "xgboost4j" % "latest_source_version_num"
If you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
.. note:: Spark 2.0 Required
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
Installation from maven repo
============================
@ -76,9 +83,11 @@ Access release version
"ml.dmlc" % "xgboost4j" % "latest_version_num"
This will checkout the latest stable version from the Maven Central.
For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
if you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
Access SNAPSHOT version
-----------------------
@ -117,9 +126,9 @@ Then add dependency as following:
For the latest release version number, please check `here <https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j>`_.
if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
.. note:: Windows not supported by published JARs
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
The published JARs from the Maven Central and GitHub currently only supports Linux and MacOS. Windows users should consider building XGBoost4J / XGBoost4J-Spark from the source. Alternatively, checkout pre-built JARs from `criteo-forks/xgboost-jars <https://github.com/criteo-forks/xgboost-jars>`_.
Enabling OpenMP for Mac OS
--------------------------
@ -136,8 +145,9 @@ Contents
********
.. toctree::
:maxdepth: 2
Java Overview Tutorial <java_intro>
java_intro
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>

View File

@ -1,28 +1,28 @@
##################
XGBoost4J Java API
##################
##############################
Getting Started with XGBoost4J
##############################
This tutorial introduces Java API for XGBoost.
**************
Data Interface
**************
Like the XGBoost python module, XGBoost4J uses ``DMatrix`` to handle data,
libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
Like the XGBoost python module, XGBoost4J uses DMatrix to handle data.
LIBSVM txt format file, sparse matrix in CSR/CSC format, and dense matrix are
supported.
* The first step is to import ``DMatrix``:
* The first step is to import DMatrix:
.. code-block:: java
import org.dmlc.xgboost4j.DMatrix;
import ml.dmlc.xgboost4j.java.DMatrix;
* Use ``DMatrix`` constructor to load data from a libsvm text format file:
* Use DMatrix constructor to load data from a libsvm text format file:
.. code-block:: java
DMatrix dmat = new DMatrix("train.svm.txt");
* Pass arrays to ``DMatrix`` constructor to load from sparse matrix.
* Pass arrays to DMatrix constructor to load from sparse matrix.
Suppose we have a sparse matrix
@ -39,7 +39,8 @@ supported.
long[] rowHeaders = new long[] {0,2,4,7};
float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
int[] colIndex = new int[] {0,2,0,3,0,1,2};
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
int numColumn = 4;
DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR, numColumn);
... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
@ -48,7 +49,8 @@ supported.
long[] colHeaders = new long[] {0,3,4,6,7};
float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
int[] rowIndex = new int[] {0,1,2,2,0,2,1};
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
int numRow = 3;
DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC, numRow);
* You may also load your data from a dense matrix. Let's assume we have a matrix of form
@ -66,7 +68,7 @@ supported.
int nrow = 3;
int ncol = 2;
float missing = 0.0f;
DMatrix dmat = new Matrix(data, nrow, ncol, missing);
DMatrix dmat = new DMatrix(data, nrow, ncol, missing);
* To set weight:
@ -78,47 +80,31 @@ supported.
******************
Setting Parameters
******************
* In XGBoost4J any ``Iterable<Entry<String, Object>>`` object could be used as parameters.
To set parameters, parameters are specified as a Map:
* To set parameters, for non-multiple value params, you can simply use entrySet of an Map:
.. code-block:: java
.. code-block:: java
Map<String, Object> paramMap = new HashMap<>() {
{
put("eta", 1.0);
put("max_depth", 2);
put("silent", 1);
put("objective", "binary:logistic");
put("eval_metric", "logloss");
}
};
Iterable<Entry<String, Object>> params = paramMap.entrySet();
* for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
.. code-block:: java
List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
{
add(new SimpleEntry<String, Object>("eta", 1.0));
add(new SimpleEntry<String, Object>("max_depth", 2.0));
add(new SimpleEntry<String, Object>("silent", 1));
add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
}
};
Map<String, Object> params = new HashMap<String, Object>() {
{
put("eta", 1.0);
put("max_depth", 2);
put("silent", 1);
put("objective", "binary:logistic");
put("eval_metric", "logloss");
}
};
**************
Training Model
**************
With parameters and data, you are able to train a booster model.
* Import ``Trainer`` and ``Booster``:
* Import Booster and XGBoost:
.. code-block:: java
import org.dmlc.xgboost4j.Booster;
import org.dmlc.xgboost4j.util.Trainer;
import ml.dmlc.xgboost4j.java.Booster;
import ml.dmlc.xgboost4j.java.XGBoost;
* Training
@ -126,13 +112,15 @@ With parameters and data, you are able to train a booster model.
DMatrix trainMat = new DMatrix("train.svm.txt");
DMatrix validMat = new DMatrix("valid.svm.txt");
//specify a watchList to see the performance
//any Iterable<Entry<String, DMatrix>> object could be used as watchList
List<Entry<String, DMatrix>> watchs = new ArrayList<>();
watchs.add(new SimpleEntry<>("train", trainMat));
watchs.add(new SimpleEntry<>("test", testMat));
int round = 2;
Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
// Specify a watch list to see model accuracy on data sets
Map<String, DMatrix> watches = new HashMap<String, DMatrix>() {
{
put("train", trainMat);
put("test", testMat);
}
};
int nround = 2;
Booster booster = XGBoost.train(trainMat, params, nround, watches, null, null);
* Saving model
@ -142,25 +130,20 @@ With parameters and data, you are able to train a booster model.
booster.saveModel("model.bin");
* Dump Model and Feature Map
* Generaing model dump with feature map
.. code-block:: java
booster.dumpModel("modelInfo.txt", false)
//dump with featureMap
booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
// dump without feature map
String[] model_dump = booster.getModelDump(null, false);
// dump with feature map
String[] model_dump_with_feature_map = booster.getModelDump("featureMap.txt", false);
* Load a model
.. code-block:: java
Params param = new Params() {
{
put("silent", 1);
put("nthread", 6);
}
};
Booster booster = new Booster(param, "model.bin");
Booster booster = XGBoost.loadModel("model.bin");
**********
Prediction
@ -170,8 +153,8 @@ After training and loading a model, you can use it to make prediction for other
.. code-block:: java
DMatrix dtest = new DMatrix("test.svm.txt");
//predict
// predict
float[][] predicts = booster.predict(dtest);
//predict leaf
float[][] leafPredicts = booster.predict(dtest, 0, true);
// predict leaf
float[][] leafPredicts = booster.predictLeaf(dtest, 0);

View File

@ -211,7 +211,7 @@ Additional parameters for Dart Booster (``booster=dart``)
- range: [0.0, 1.0]
Parameters for Linear Booster (``booster=gbtree``)
Parameters for Linear Booster (``booster=gblinear``)
==================================================
* ``lambda`` [default=0, alias: ``reg_lambda``]
@ -280,7 +280,7 @@ Specify the learning task and the corresponding learning objective. The objectiv
- ``error``: Binary classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- ``error@t``: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through 't'.
- ``merror``: Multiclass classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``.
- ``mlogloss``: `Multiclass logloss <https://www.kaggle.com/wiki/LogLoss>`_.
- ``mlogloss``: `Multiclass logloss <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html>`_.
- ``auc``: `Area under the curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve>`_
- ``ndcg``: `Normalized Discounted Cumulative Gain <http://en.wikipedia.org/wiki/NDCG>`_
- ``map``: `Mean average precision <http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision>`_

View File

@ -19,7 +19,7 @@ However, such complicated model requires more data to fit.
Most of parameters in XGBoost are about bias variance tradeoff. The best model
should trade the model complexity with its predictive power carefully.
:doc:`Parameters Documentation </parameter>` will tell you whether each parameter
ill make the model more conservative or not. This can be used to help you
will make the model more conservative or not. This can be used to help you
turn the knob between complicated model and simple model.
*******************
@ -27,16 +27,16 @@ Control Overfitting
*******************
When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.
There are in general two ways that you can control overfitting in XGBoost
There are in general two ways that you can control overfitting in XGBoost:
* The first way is to directly control model complexity
* The first way is to directly control model complexity.
- This include ``max_depth``, ``min_child_weight`` and ``gamma``
- This includes ``max_depth``, ``min_child_weight`` and ``gamma``.
* The second way is to add randomness to make training robust to noise
* The second way is to add randomness to make training robust to noise.
- This include ``subsample`` and ``colsample_bytree``.
- You can also reduce stepsize ``eta``. Rremember to increase ``num_round`` when you do so.
- This includes ``subsample`` and ``colsample_bytree``.
- You can also reduce stepsize ``eta``. Remember to increase ``num_round`` when you do so.
*************************
Handle Imbalanced Dataset

View File

@ -68,8 +68,12 @@ be found in the [examples package](https://github.com/dmlc/xgboost/tree/master/j
**NOTE on LIBSVM Format**:
* Use *1-based* ascending indexes for the LIBSVM format in distributed training mode
There is an inconsistent issue between XGBoost4J-Spark and other language bindings of XGBoost.
* Spark does the internal conversion, and does not accept formats that are 0-based
When users use Spark to load trainingset/testset in LibSVM format with the following code snippet:
* Whereas, use *0-based* indexes format when predicting in normal mode - for instance, while using the saved model in the Python package
```scala
spark.read.format("libsvm").load("trainingset_libsvm")
```
Spark assumes that the dataset is 1-based indexed. However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is 0-based indexed. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost.