Backport doc fixes that are compatible with 0.72 release

* Clarify behavior of LIBSVM in XGBoost4J-Spark (#3524) * Fix typo in faq.rst (#3521) * Fix typo in parameter.rst, gblinear section (#3518) * Clarify supported OSes for XGBoost4J published JARs (#3547) * Update broken links (#3565) * Grammar fixes and typos (#3568) * Bring XGBoost4J Intro up-to-date (#3574)
2018-07-28 17:34:39 -07:00 · 2018-07-28 17:34:39 -07:00 · 4334b9cc91
commit 4334b9cc91
parent e19dded9a3
6 changed files with 77 additions and 80 deletions
--- a/doc/faq.rst
+++ b/doc/faq.rst
@ -7,7 +7,7 @@ This document contains frequently asked questions about XGBoost.
 **********************
 How to tune parameters
 **********************
-See :doc:`Parameter Tunning Guide </tutorials/param_tuning>`.
+See :doc:`Parameter Tuning Guide </tutorials/param_tuning>`.

 ************************
 Description on the model
--- a/doc/jvm/index.rst
+++ b/doc/jvm/index.rst
@ -56,6 +56,13 @@ For sbt, please add the repository and dependency in build.sbt as following:

  "ml.dmlc" % "xgboost4j" % "latest_source_version_num"

+If you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.
+
+.. note:: Spark 2.0 Required
+
+  After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``.   (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
+
+
 Installation from maven repo
 ============================

@ -76,9 +83,11 @@ Access release version

  "ml.dmlc" % "xgboost4j" % "latest_version_num"

+This will checkout the latest stable version from the Maven Central.
+
 For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.

-if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
+if you want to use XGBoost4J-Spark, replace ``xgboost4j`` with ``xgboost4j-spark``.

 Access SNAPSHOT version
 -----------------------
@ -117,9 +126,9 @@ Then add dependency as following:

 For the latest release version number, please check `here <https://github.com/CodingCat/xgboost/tree/maven-repo/ml/dmlc/xgboost4j>`_.

-if you want to use XGBoost4J-Spark, you just need to replace ``xgboost4j`` with ``xgboost4j-spark``.
+.. note:: Windows not supported by published JARs

-After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running ``mvn package``, and you can specify the version of spark with ``mvn -Dspark.version=2.0.0 package``.   (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like ``spark.version``, ``scala.version``, and ``scala.binary.version``. Users also need to change the implementation by replacing ``SparkSession`` with ``SQLContext`` and the type of API parameters from ``Dataset[_]`` to ``Dataframe``)
+  The published JARs from the Maven Central and GitHub currently only supports Linux and MacOS. Windows users should consider building XGBoost4J / XGBoost4J-Spark from the source. Alternatively, checkout pre-built JARs from `criteo-forks/xgboost-jars <https://github.com/criteo-forks/xgboost-jars>`_.

 Enabling OpenMP for Mac OS
 --------------------------
@ -136,8 +145,9 @@ Contents
 ********

 .. toctree::
+  :maxdepth: 2

-  Java Overview Tutorial <java_intro>
+  java_intro
  Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
  XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
  XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>
--- a/doc/jvm/java_intro.rst
+++ b/doc/jvm/java_intro.rst
@ -1,28 +1,28 @@
-##################
-XGBoost4J Java API
-##################
+##############################
+Getting Started with XGBoost4J
+##############################
 This tutorial introduces Java API for XGBoost.

 **************
 Data Interface
 **************
-Like the XGBoost python module, XGBoost4J uses ``DMatrix`` to handle data,
-libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
+Like the XGBoost python module, XGBoost4J uses DMatrix to handle data.
+LIBSVM txt format file, sparse matrix in CSR/CSC format, and dense matrix are
 supported.

-* The first step is to import ``DMatrix``:
+* The first step is to import DMatrix:

  .. code-block:: java

-    import org.dmlc.xgboost4j.DMatrix;
+    import ml.dmlc.xgboost4j.java.DMatrix;

-* Use ``DMatrix`` constructor to load data from a libsvm text format file:
+* Use DMatrix constructor to load data from a libsvm text format file:

  .. code-block:: java

    DMatrix dmat = new DMatrix("train.svm.txt");

-* Pass arrays to ``DMatrix`` constructor to load from sparse matrix.
+* Pass arrays to DMatrix constructor to load from sparse matrix.

  Suppose we have a sparse matrix
  
@ -39,7 +39,8 @@ supported.
    long[] rowHeaders = new long[] {0,2,4,7};
    float[] data = new float[] {1f,2f,4f,3f,3f,1f,2f};
    int[] colIndex = new int[] {0,2,0,3,0,1,2};
-    DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR);
+    int numColumn = 4;
+    DMatrix dmat = new DMatrix(rowHeaders, colIndex, data, DMatrix.SparseType.CSR, numColumn);
  
  ... or in `Compressed Sparse Column (CSC) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS)>`_ format:
  
@ -48,7 +49,8 @@ supported.
    long[] colHeaders = new long[] {0,3,4,6,7};
    float[] data = new float[] {1f,4f,3f,1f,2f,2f,3f};
    int[] rowIndex = new int[] {0,1,2,2,0,2,1};
-    DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC);
+    int numRow = 3;
+    DMatrix dmat = new DMatrix(colHeaders, rowIndex, data, DMatrix.SparseType.CSC, numRow);

 * You may also load your data from a dense matrix. Let's assume we have a matrix of form

@ -66,7 +68,7 @@ supported.
    int nrow = 3;
    int ncol = 2;
    float missing = 0.0f;
-    DMatrix dmat = new Matrix(data, nrow, ncol, missing);
+    DMatrix dmat = new DMatrix(data, nrow, ncol, missing);

 * To set weight:

@ -78,47 +80,31 @@ supported.
 ******************
 Setting Parameters
 ******************
-* In XGBoost4J any ``Iterable<Entry<String, Object>>`` object could be used as parameters.
+To set parameters, parameters are specified as a Map:

-* To set parameters, for non-multiple value params, you can simply use entrySet of an Map:
+.. code-block:: java

-  .. code-block:: java
-
-    Map<String, Object> paramMap = new HashMap<>() {
-      {
-        put("eta", 1.0);
-        put("max_depth", 2);
-        put("silent", 1);
-        put("objective", "binary:logistic");
-        put("eval_metric", "logloss");
-      }
-    };
-    Iterable<Entry<String, Object>> params = paramMap.entrySet();
-
-* for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
-
-  .. code-block:: java
-
-    List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
-        {
-            add(new SimpleEntry<String, Object>("eta", 1.0));
-            add(new SimpleEntry<String, Object>("max_depth", 2.0));
-            add(new SimpleEntry<String, Object>("silent", 1));
-            add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
-        }
-    };
+  Map<String, Object> params = new HashMap<String, Object>() {
+    {
+      put("eta", 1.0);
+      put("max_depth", 2);
+      put("silent", 1);
+      put("objective", "binary:logistic");
+      put("eval_metric", "logloss");
+    }
+  };

 **************
 Training Model
 **************
 With parameters and data, you are able to train a booster model.

-* Import ``Trainer`` and ``Booster``:
+* Import Booster and XGBoost:

  .. code-block:: java

-    import org.dmlc.xgboost4j.Booster;
-    import org.dmlc.xgboost4j.util.Trainer;
+    import ml.dmlc.xgboost4j.java.Booster;
+    import ml.dmlc.xgboost4j.java.XGBoost;

 * Training

@ -126,13 +112,15 @@ With parameters and data, you are able to train a booster model.

    DMatrix trainMat = new DMatrix("train.svm.txt");
    DMatrix validMat = new DMatrix("valid.svm.txt");
-    //specify a watchList to see the performance
-    //any Iterable<Entry<String, DMatrix>> object could be used as watchList
-    List<Entry<String, DMatrix>> watchs =  new ArrayList<>();
-    watchs.add(new SimpleEntry<>("train", trainMat));
-    watchs.add(new SimpleEntry<>("test", testMat));
-    int round = 2;
-    Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
+    // Specify a watch list to see model accuracy on data sets
+    Map<String, DMatrix> watches = new HashMap<String, DMatrix>() {
+      {
+        put("train", trainMat);
+        put("test", testMat);
+      }
+    };
+    int nround = 2;
+    Booster booster = XGBoost.train(trainMat, params, nround, watches, null, null);

 * Saving model

@ -142,25 +130,20 @@ With parameters and data, you are able to train a booster model.

    booster.saveModel("model.bin");

-* Dump Model and Feature Map
+* Generaing model dump with feature map

  .. code-block:: java

-    booster.dumpModel("modelInfo.txt", false)
-    //dump with featureMap
-    booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
+    // dump without feature map
+    String[] model_dump = booster.getModelDump(null, false);
+    // dump with feature map
+    String[] model_dump_with_feature_map = booster.getModelDump("featureMap.txt", false);

 * Load a model

  .. code-block:: java

-    Params param = new Params() {
-      {
-        put("silent", 1);
-        put("nthread", 6);
-      }
-    };
-    Booster booster = new Booster(param, "model.bin");
+    Booster booster = XGBoost.loadModel("model.bin");

 **********
 Prediction
@ -170,8 +153,8 @@ After training and loading a model, you can use it to make prediction for other
 .. code-block:: java

  DMatrix dtest = new DMatrix("test.svm.txt");
-  //predict
+  // predict
  float[][] predicts = booster.predict(dtest);
-  //predict leaf
-  float[][] leafPredicts = booster.predict(dtest, 0, true);
+  // predict leaf
+  float[][] leafPredicts = booster.predictLeaf(dtest, 0);

--- a/doc/parameter.rst
+++ b/doc/parameter.rst
@ -211,7 +211,7 @@ Additional parameters for Dart Booster (``booster=dart``)

  - range: [0.0, 1.0]

-Parameters for Linear Booster (``booster=gbtree``)
+Parameters for Linear Booster (``booster=gblinear``)
 ==================================================
 * ``lambda`` [default=0, alias: ``reg_lambda``]

@ -280,7 +280,7 @@ Specify the learning task and the corresponding learning objective. The objectiv
    - ``error``: Binary classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
    - ``error@t``: a different than 0.5 binary classification threshold value could be specified by providing a numerical value through 't'.
    - ``merror``: Multiclass classification error rate. It is calculated as ``#(wrong cases)/#(all cases)``.
-    - ``mlogloss``: `Multiclass logloss <https://www.kaggle.com/wiki/LogLoss>`_.
+    - ``mlogloss``: `Multiclass logloss <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html>`_.
    - ``auc``: `Area under the curve <http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_curve>`_
    - ``ndcg``: `Normalized Discounted Cumulative Gain <http://en.wikipedia.org/wiki/NDCG>`_
    - ``map``: `Mean average precision <http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision>`_
--- a/doc/tutorials/param_tuning.rst
+++ b/doc/tutorials/param_tuning.rst
@ -19,7 +19,7 @@ However, such complicated model requires more data to fit.
 Most of parameters in XGBoost are about bias variance tradeoff. The best model
 should trade the model complexity with its predictive power carefully.
 :doc:`Parameters Documentation </parameter>` will tell you whether each parameter
-ill make the model more conservative or not. This can be used to help you
+will make the model more conservative or not. This can be used to help you
 turn the knob between complicated model and simple model.

 *******************
@ -27,16 +27,16 @@ Control Overfitting
 *******************
 When you observe high training accuracy, but low test accuracy, it is likely that you encountered overfitting problem.

-There are in general two ways that you can control overfitting in XGBoost
+There are in general two ways that you can control overfitting in XGBoost:

-* The first way is to directly control model complexity
+* The first way is to directly control model complexity.

-  - This include ``max_depth``, ``min_child_weight`` and ``gamma``
+  - This includes ``max_depth``, ``min_child_weight`` and ``gamma``.

-* The second way is to add randomness to make training robust to noise
+* The second way is to add randomness to make training robust to noise.

-  - This include ``subsample`` and ``colsample_bytree``.
-  - You can also reduce stepsize ``eta``. Rremember to increase ``num_round`` when you do so.
+  - This includes ``subsample`` and ``colsample_bytree``.
+  - You can also reduce stepsize ``eta``. Remember to increase ``num_round`` when you do so.

 *************************
 Handle Imbalanced Dataset
--- a/jvm-packages/README.md
+++ b/jvm-packages/README.md
@ -68,8 +68,12 @@ be found in the [examples package](https://github.com/dmlc/xgboost/tree/master/j

 **NOTE on LIBSVM Format**: 

-* Use *1-based* ascending indexes for the LIBSVM format in distributed training mode
+There is an inconsistent issue between XGBoost4J-Spark and other language bindings of XGBoost. 

-    * Spark does the internal conversion, and does not accept formats that are 0-based
+When users use Spark to load trainingset/testset in LibSVM format with the following code snippet:

-* Whereas, use *0-based* indexes format when predicting in normal mode - for instance, while using the saved model in the Python package
+```scala
+spark.read.format("libsvm").load("trainingset_libsvm")
+```
+
+Spark assumes that the dataset is 1-based indexed. However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is 0-based indexed. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost.