fixed some typos (#1814)
This commit is contained in:
committed by
Yuan (Terry) Tang
parent
be2f28ec08
commit
da2556f58a
@@ -17,7 +17,7 @@ To publish the artifacts to your local maven repository, run
|
||||
|
||||
mvn install
|
||||
|
||||
Or, if you would like to skip tests, run
|
||||
Or, if you would like to skip tests, run
|
||||
|
||||
mvn -DskipTests install
|
||||
|
||||
@@ -32,7 +32,7 @@ This command will publish the xgboost binaries, the compiled java classes as wel
|
||||
|
||||
|
||||
|
||||
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running `mvn package`, and you can specify the version of spark with `mvn -Dspark.version=2.0.0 package`. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like `spark.version`, `scala.version`, and `scala.binary.version`. Users also need to change the implemention by replacing SparkSession with SQLContext and the type of API parameters from Dataset[_] to Dataframe)
|
||||
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running `mvn package`, and you can specify the version of spark with `mvn -Dspark.version=2.0.0 package`. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like `spark.version`, `scala.version`, and `scala.binary.version`. Users also need to change the implementation by replacing SparkSession with SQLContext and the type of API parameters from Dataset[_] to Dataframe)
|
||||
|
||||
Contents
|
||||
--------
|
||||
|
||||
@@ -133,7 +133,7 @@ Booster booster = new Booster(param, "model.bin");
|
||||
```
|
||||
|
||||
## Prediction
|
||||
after training and loading a model, you use it to predict other data, the predict results will be a two-dimension float array (nsample, nclass) ,for predict leaf, it would be (nsample, nclass*ntrees)
|
||||
after training and loading a model, you use it to predict other data, the predict results will be a two-dimension float array (nsample, nclass), for predict leaf, it would be (nsample, nclass*ntrees)
|
||||
```java
|
||||
DMatrix dtest = new DMatrix("test.svm.txt");
|
||||
//predict
|
||||
|
||||
@@ -26,7 +26,7 @@ They are also often [much more efficient](http://arxiv.org/abs/1603.02754).
|
||||
|
||||
The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Spark/Flink to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file systems and then conduct the following machine learning phase. This process jumping across two types of systems creates certain inconvenience for the users and brings additional overhead to the operators of the infrastructure.
|
||||
|
||||
We want best of both worlds, so we can use the data processing frameworks like Spark and Flink toghether with
|
||||
We want best of both worlds, so we can use the data processing frameworks like Spark and Flink together with
|
||||
the best distributed machine learning solutions.
|
||||
To resolve the situation, we introduce the new-brewed [XGBoost4J](https://github.com/dmlc/xgboost/tree/master/jvm-packages),
|
||||
<b>XGBoost</b> for <b>J</b>VM Platform. We aim to provide the clean Java/Scala APIs and the integration with the most popular data processing systems developed in JVM-based languages.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
## Introduction
|
||||
## Introduction
|
||||
|
||||
On March 2016, we released the first version of [XGBoost4J](http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), which is a set of packages providing Java/Scala interfaces of XGBoost and the integration with prevalent JVM-based distributed data processing platforms, like Spark/Flink.
|
||||
On March 2016, we released the first version of [XGBoost4J](http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), which is a set of packages providing Java/Scala interfaces of XGBoost and the integration with prevalent JVM-based distributed data processing platforms, like Spark/Flink.
|
||||
|
||||
The integrations with Spark/Flink, a.k.a. <b>XGBoost4J-Spark</b> and <b>XGBoost-Flink</b>, receive the tremendous positive feedbacks from the community. It enables users to build a unified pipeline, embedding XGBoost into the data processing system based on the widely-deployed frameworks like Spark. The following figure shows the general architecture of such a pipeline with the first version of <b>XGBoost4J-Spark</b>, where the data processing is based on the low-level [Resilient Distributed Dataset (RDD)](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds) abstraction.
|
||||
|
||||
@@ -12,14 +12,14 @@ In the last months, we have a lot of communication with the users and gain the d
|
||||
|
||||
* While Spark is still the mainstream data processing tool in most of scenarios, more and more users are porting their RDD-based Spark programs to [DataFrame/Dataset APIs](http://spark.apache.org/docs/latest/sql-programming-guide.html) for the well-designed interfaces to manipulate structured data and the [significant performance improvement](https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html).
|
||||
|
||||
* Spark itself has presented a clear roadmap that DataFrame/Dataset would be the base of the latest and future features, e.g. latest version of [ML pipeline](http://spark.apache.org/docs/latest/ml-guide.html) and [Structured Streaming](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).
|
||||
* Spark itself has presented a clear roadmap that DataFrame/Dataset would be the base of the latest and future features, e.g. latest version of [ML pipeline](http://spark.apache.org/docs/latest/ml-guide.html) and [Structured Streaming](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html).
|
||||
|
||||
Based on these feedbacks from the users, we observe a gap between the original RDD-based XGBoost4J-Spark and the users' latest usage scenario as well as the future direction of Spark ecosystem. To fill this gap, we start working on the <b><i>integration of XGBoost and Spark's DataFrame/Dataset abstraction</i></b> in September. In this blog, we will introduce <b>the latest version of XGBoost4J-Spark</b> which allows the user to work with DataFrame/Dataset directly and embed XGBoost to Spark's ML pipeline seamlessly.
|
||||
Based on these feedbacks from the users, we observe a gap between the original RDD-based XGBoost4J-Spark and the users' latest usage scenario as well as the future direction of Spark ecosystem. To fill this gap, we start working on the <b><i>integration of XGBoost and Spark's DataFrame/Dataset abstraction</i></b> in September. In this blog, we will introduce <b>the latest version of XGBoost4J-Spark</b> which allows the user to work with DataFrame/Dataset directly and embed XGBoost to Spark's ML pipeline seamlessly.
|
||||
|
||||
|
||||
## A Full Integration of XGBoost and DataFrame/Dataset
|
||||
|
||||
The following figure illustrates the new pipeline architecture with the latest XGBoost4J-Spark.
|
||||
The following figure illustrates the new pipeline architecture with the latest XGBoost4J-Spark.
|
||||
|
||||

|
||||
|
||||
@@ -49,7 +49,7 @@ import org.apache.spark.ml.feature.StringIndexer
|
||||
// load sales records saved in json files
|
||||
val salesDF = spark.read.json("sales.json")
|
||||
|
||||
// transfrom the string-represented storeType feature to numeric storeTypeIndex
|
||||
// transform the string-represented storeType feature to numeric storeTypeIndex
|
||||
val indexer = new StringIndexer()
|
||||
.setInputCol("storeType")
|
||||
.setOutputCol("storeTypeIndex")
|
||||
@@ -71,7 +71,7 @@ import org.apache.spark.ml.feature.StringIndexer
|
||||
// load sales records saved in json files
|
||||
val salesDF = spark.read.json("sales.json")
|
||||
|
||||
// transfrom the string-represented storeType feature to numeric storeTypeIndex
|
||||
// transform the string-represented storeType feature to numeric storeTypeIndex
|
||||
val indexer = new StringIndexer()
|
||||
.setInputCol("storeType")
|
||||
.setOutputCol("storeTypeIndex")
|
||||
@@ -99,7 +99,7 @@ val salesRecordsWithPred = xgboostModel.transform(salesTestDF)
|
||||
The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model. Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark, we can utilize the Spark model selecting tool to automate this process. The following example shows the code snippet utilizing [TrainValidationSplit](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit) and [RegressionEvaluator](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator) to search the optimal combination of two XGBoost parameters, [max_depth and eta] (https://github.com/dmlc/xgboost/blob/master/doc/parameter.md). The model producing the minimum cost function value defined by RegressionEvaluator is selected and used to generate the prediction for the test set.
|
||||
|
||||
```scala
|
||||
// create XGBoostEstimator
|
||||
// create XGBoostEstimator
|
||||
val xgbEstimator = new XGBoostEstimator(xgboostParam).setFeaturesCol("features").
|
||||
setLabelCol("sales")
|
||||
val paramGrid = new ParamGridBuilder()
|
||||
@@ -137,5 +137,3 @@ If you are interested in knowing more about XGBoost, you can find rich resources
|
||||
- [Tutorials for the R package](xgboost.readthedocs.org/en/latest/R-package/index.html)
|
||||
- [Introduction of the Parameters](http://xgboost.readthedocs.org/en/latest/parameter.html)
|
||||
- [Awesome XGBoost, a curated list of examples, tutorials, blogs about XGBoost usecases](https://github.com/dmlc/xgboost/tree/master/demo)
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user