diff --git a/doc/jvm/xgboost4j-intro.md b/doc/jvm/xgboost4j-intro.md new file mode 100644 index 000000000..d8bdb015a --- /dev/null +++ b/doc/jvm/xgboost4j-intro.md @@ -0,0 +1,209 @@ +--- +layout: post +title: "XGBoost4J Package Released" +date: 2016-03-15 12:00:00 +author: Nan Zhu Tianqi Chen +categories: rstats +comments: true +--- + +#Introduction# +[XGBoost](https://github.com/dmlc/xgboost) is a library designed and optimized for boosting trees algorithms. Gradient boosting trees model is originally proposed by Friedman et al. By embracing multi-threads and introducing regularization, XGBoost delivers higher computational power and more accurate prediction. We have witnessed that the more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost ([Incomplete list](https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)). Until three weeks ago, XGBoost has provided C++, R, python, Julia and Java interfaces to the various target user groups. + +We started the project of [xgboost4j](https://github.com/dmlc/xgboost/tree/master/jvm-packages) (XGBoost for JVM) three weeks ago, including the new design/implementation of Java/Scala interface and the integration with the dataflow frameworks. Today, we are happy to announce the availability of the first version of XGBoost4J. In this post, we would like to have a brief introduction to this new package of XGBoost. + +#Motivation# + +Programming languages and data processing/storage systems based on Java Virtual Machine (JVM) play the significant roles in the BigData ecosystem. [Hadoop](http://hadoop.apache.org/) and [Spark](http://spark.apache.org/) which take majority of the market share of the general large-scale data processing systems are both implemented by JVM-languages. On the other side, a lot of machine learning libraries/systems (e.g. [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) which exhibit the excellent performance in various scenarios are implemented by more "native" programming languages, e.g. C++. + +The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Spark to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file system and then conduct the following machine learning phase. In the case of data format changing or trying new features, the user has to walk into this process time and time again. + +To resolve the situation, we introduce the new-brewed XGBoost4J, XGBoost for JVM Platform. We aim to provide the clean Java/Scala APIs and the integration with the most popular data processing systems developed in JVM-based languages. + + +#System Overview# + +In the following Figure, we describe the overall architecture of XGBoost4J. XGBoost4J provides the Java/Scala API wrapping the core functionality of XGBoost library and most importantly, it not only supports the single-machine model training, but also provides an abstraction layer which masks the difference of the underlying data processing engines (they can be Spark, Flink, or just distributed servers across the cluster) + + +![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/xgboost4j.png) + + +By calling the XGBoost4J API, users can scale the model training to the cluster. XGBoost4J wraps the running instance of XGBoost in Spark/Flink task and run them across the cluster. The communication among the distributed model training tasks and the XGBoost4J runtime environment go through [Rabit] (https://github.com/dmlc/rabit). + +With the abstraction of XGBoost4J, users can build an unified data analytic application ranging from Extract-Transform-Loading, data exploration, machine learning model training and the final data product service. The following figure illustrate an example application built on top of Apache Spark. The application seamlessly embeds XGBoost into the processing pipeline and exchange data with other Spark-based processing phase through Spark's distributed memory layer. + +![XGBoost4J Architecture](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/unified_pipeline.png) + + + +# Walk-through + +In this section, we will work through the APIs of XGBoost4J by examples. We will cover the single-machine as well as distributed APIs. + +####Single-machine Training + + +To start the model training and evaluation, we need to prepare the training and test set: + +In Java, we do: + +```java +// load file from text file, also binary buffer generated by xgboost4j +DMatrix trainMat = new DMatrix("../../demo/data/agaricus.txt.train"); +DMatrix testMat = new DMatrix("../../demo/data/agaricus.txt.test"); + +``` + +Or in Scala: + +```scala +val trainMax = new DMatrix("../../demo/data/agaricus.txt.train") +val testMax = new DMatrix("../../demo/data/agaricus.txt.test") +``` + +After preparing the data, we can train our model: + +In Java: + +```java +HashMap params = new HashMap(); +params.put("eta", 1.0); +params.put("max_depth", 2); +params.put("silent", 1); +params.put("objective", "binary:logistic"); + + +HashMap watches = new HashMap(); +watches.put("train", trainMat); +watches.put("test", testMat); + +//set round +int round = 2; + +//train a boost model +Booster booster = XGBoost.train(params, trainMat, round, watches, null, null); +``` + + +In Scala + +```scala +val params = new mutable.HashMap[String, Any]() +params += "eta" -> 1.0 +params += "max_depth" -> 2 +params += "silent" -> 1 +params += "objective" -> "binary:logistic" + +val watches = new mutable.HashMap[String, DMatrix] +watches += "train" -> trainMax +watches += "test" -> testMax + +val round = 2 +// train a model +val booster = XGBoost.train(params.toMap, trainMax, round, watches.toMap) +``` + +With the booster we got in either Java or Scala, we can evaluate it with our testset. + +In Java: + +```java +float[][] predicts = booster.predict(testMat); +``` + +In Scala: + +```scala +val predicts = booster.predict(testMax) +``` + +`predict` can output the output results and you can define a customized evaluation method to derive your own metrics (see the example in ([Customized Evaluation Metric in Java](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/CustomObjective.java), [Customized Evaluation Metric in Scala] (https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/CustomObjective.scala)). + +####Distributed Model Training with Distributed Dataflow Framework## + +The most exciting part in this XGBoost4J release is the integration with the Distributed Dataflow Framework. The most popular data processing frameworks fall into this category, e.g. [Apache Spark](http://spark.apache.org/), [Apache Flink] (http://flink.apache.org/), etc. In this part, we will walk through the steps to build the unified data analytic applications containing data preprocessing and distributed model training with Spark and Flink. (currently, we only provide Scala API for the integration with Spark and Flink) + +Similar to the single-machine training, we need to prepare the training and test dataset. + +In Spark, the dataset is represented as the [Resilient Distributed Dataset (RDD)](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds), we can utilize the Spark-distributed tools to parse libSVM file and wrap it as the RDD: + +```scala +val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).repartition(args(1).toInt) +``` + +In Flink, we do the similar stuffs and represent training data as Flink's [DataSet](https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html) + +```scala +val trainData = MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.train") +``` + + +We move forward to train the models, in Spark: + + +```scala +val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound) + +``` + +and in Flink: + +```scala +val xgboostModel = XGBoost.train(trainData, paramMap, round) + +``` + + +The next step is to evaluate the model, you can either predict in local side or in a distributed fashion + +In Spark + +```scala +// testSet is an RDD containing testset data represented as +// org.apache.spark.mllib.regression.LabeledPoint +val testSet = MLUtils.loadLibSVMFile(sc, inputTestPath) + +// local prediction +// import methods in DataUtils to convert Iterator[org.apache.spark.mllib.regression.LabeledPoint] +// to Iterator[ml.dmlc.xgboost4j.LabeledPoint] in automatic +import DataUtils._ +xgboostModel.predict(new DMatrix(testSet.collect().iterator) + +// distributed prediction +xgboostModel.predict(testSet) + +``` + +In Flink + +```scala +// testData is a Dataset containing testset data represented as +// org.apache.flink.ml.math.Vector.LabeledVector +val testData = MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.test") + +// local prediction +xgboostModel.predict(testData.collect().iterator) + +// distributed prediction +xgboostModel.predict(testData.map{x => x.vector}) +``` + +#Road Map # + +It is the first release of XGBoost4J package, we are actively move forward for more charming features in the next release. You can watch our progress in [XGBoost4J Road Map](https://github.com/dmlc/xgboost/issues/935). + +#Further Readings# + +If you are interested in knowing more about XGBoost, you can find rich resources in + +- [The github repository of XGBoost](https://github.com/dmlc/xgboost) +- [The comprehensive documentation site for XGBoostl](http://xgboost.readthedocs.org/en/latest/index.html) +- [An introduction to the gradient boosting model](http://xgboost.readthedocs.org/en/latest/model.html) +- [Tutorials for the R package](xgboost.readthedocs.org/en/latest/R-package/index.html) +- [Introduction of the Parameters](http://xgboost.readthedocs.org/en/latest/parameter.html) +- [Awesome XGBoost, a curated list of examples, tutorials, blogs about XGBoost usecases](https://github.com/dmlc/xgboost/tree/master/demo) + +#Acknowledgements# + +We would like to send many thanks to [Zixuan Huang](https://github.com/yanqingmen), the early developer of XGBoost for Java (XGBoost for Java). \ No newline at end of file