Merge pull request #3 from CodingCat/fix_examples

adjust the API signature as well as the docs
This commit is contained in:
Tianqi Chen 2016-03-11 12:29:33 -08:00
commit 57987100bc
28 changed files with 118 additions and 95 deletions

View File

@ -24,7 +24,7 @@ Many of these machine learning libraries(e.g. [XGBoost](https://github.com/dmlc/
requires new computation abstraction and native support(e.g. C++ for GPU computing).
They are also often [much more efficient](http://arxiv.org/abs/1603.02754).
The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Flink/Spark to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file system and then conduct the following machine learning phase. While such process won't hurt performance as much in data processing case(because machine learning takes a lot of time compared to data loading), it create a bit inconvenience for the users.
The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Flink/Spark to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file system and then conduct the following machine learning phase. While such process won't hurt performance as much in data processing case(because machine learning takes a lot of time compared to data loading), it creates a bit inconvenience for the users.
We want best of both worlds, so we can use the data processing frameworks like Flink and Spark toghether with
the best distributed machine learning solutions.
@ -37,7 +37,7 @@ XGBoost and XGBoost4J adopts Unix Philosophy.
XGBoost **does its best in one thing -- tree boosting** and is **being designed to work with other systems**.
We strongly believe that machine learning solution should not be restricted to certain language or certain platform.
Specifically, users will be able to use distributed XGBoost in both Flink and Spark.
Specifically, users will be able to use distributed XGBoost in both Flink and Spark, and possibly more frameworks in Future.
We have made the API in a portable way so it **can be easily ported to other Dataflow frameworks provided by the Cloud**.
XGBoost4J shares its core with other XGBoost libraries, which means data scientists can use R/python
read and visualize the model trained distributedly.
@ -85,10 +85,10 @@ watches += "test" -> testMax
val round = 2
// train a model
val booster = XGBoost.train(params.toMap, trainMax, round, watches.toMap)
val booster = XGBoost.train(trainMax, params.toMap, round, watches.toMap)
```
In Scala:
We then evaluate our model:
```scala
val predicts = booster.predict(testMax)
@ -111,7 +111,7 @@ In Spark, the dataset is represented as the [Resilient Distributed Dataset (RDD)
val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).repartition(args(1).toInt)
```
We move forward to train the models, in Spark:
We move forward to train the models:
```scala
val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound)
@ -169,6 +169,8 @@ xgboostModel.predict(testData.map{x => x.vector})
It is the first release of XGBoost4J package, we are actively move forward for more charming features in the next release. You can watch our progress in [XGBoost4J Road Map](https://github.com/dmlc/xgboost/issues/935).
While we are trying our best to keep the minimum changes to the APIs, it is still subject to the incompatible changes.
## Further Readings
If you are interested in knowing more about XGBoost, you can find rich resources in

View File

@ -34,7 +34,7 @@ object XGBoostScalaExample {
// number of iterations
val round = 2
// train the model
val model = XGBoost.train(paramMap, trainData, round)
val model = XGBoost.train(trainData, paramMap, round)
// run prediction
val predTrain = model.predict(trainData)
// save model to the file.
@ -43,34 +43,6 @@ object XGBoostScalaExample {
}
```
### XGBoost Flink
```scala
import ml.dmlc.xgboost4j.scala.flink.XGBoost
import org.apache.flink.api.scala._
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.ml.MLUtils
object DistTrainWithFlink {
def main(args: Array[String]) {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
// read trainining data
val trainData =
MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.train")
// define parameters
val paramMap = List(
"eta" -> 0.1,
"max_depth" -> 2,
"objective" -> "binary:logistic").toMap
// number of iterations
val round = 2
// train the model
val model = XGBoost.train(paramMap, trainData, round)
val predTrain = model.predict(trainData.map{x => x.vector})
model.saveModelToHadoop("file:///path/to/xgboost.model")
}
}
```
### XGBoost Spark
```scala
import org.apache.spark.SparkContext
@ -101,3 +73,33 @@ object DistTrainWithSpark {
}
}
```
### XGBoost Flink
```scala
import ml.dmlc.xgboost4j.scala.flink.XGBoost
import org.apache.flink.api.scala._
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.ml.MLUtils
object DistTrainWithFlink {
def main(args: Array[String]) {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
// read trainining data
val trainData =
MLUtils.readLibSVM(env, "/path/to/data/agaricus.txt.train")
// define parameters
val paramMap = List(
"eta" -> 0.1,
"max_depth" -> 2,
"objective" -> "binary:logistic").toMap
// number of iterations
val round = 2
// train the model
val model = XGBoost.train(trainData, paramMap, round)
val predTrain = model.predict(trainData.map{x => x.vector})
model.saveModelToHadoop("file:///path/to/xgboost.model")
}
}
```

View File

@ -67,7 +67,7 @@ public class BasicWalkThrough {
int round = 2;
//train a boost model
Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
//predict
float[][] predicts = booster.predict(testMat);
@ -111,7 +111,7 @@ public class BasicWalkThrough {
HashMap<String, DMatrix> watches2 = new HashMap<String, DMatrix>();
watches2.put("train", trainMat2);
watches2.put("test", testMat2);
Booster booster3 = XGBoost.train(params, trainMat2, round, watches2, null, null);
Booster booster3 = XGBoost.train(trainMat2, params, round, watches2, null, null);
float[][] predicts3 = booster3.predict(testMat2);
//check predicts

View File

@ -48,7 +48,7 @@ public class BoostFromPrediction {
watches.put("test", testMat);
//train xgboost for 1 round
Booster booster = XGBoost.train(params, trainMat, 1, watches, null, null);
Booster booster = XGBoost.train(trainMat, params, 1, watches, null, null);
float[][] trainPred = booster.predict(trainMat, true);
float[][] testPred = booster.predict(testMat, true);
@ -57,6 +57,6 @@ public class BoostFromPrediction {
testMat.setBaseMargin(testPred);
System.out.println("result of running from initial prediction");
Booster booster2 = XGBoost.train(params, trainMat, 1, watches, null, null);
Booster booster2 = XGBoost.train(trainMat, params, 1, watches, null, null);
}
}

View File

@ -49,7 +49,7 @@ public class CrossValidation {
//set additional eval_metrics
String[] metrics = null;
String[] evalHist = XGBoost.crossValidation(params, trainMat, round, nfold, metrics, null,
String[] evalHist = XGBoost.crossValidation(trainMat, params, round, nfold, metrics, null,
null);
}
}

View File

@ -163,6 +163,6 @@ public class CustomObjective {
//train a booster
System.out.println("begin to train the booster model");
Booster booster = XGBoost.train(params, trainMat, round, watches, obj, eval);
Booster booster = XGBoost.train(trainMat, params, round, watches, obj, eval);
}
}

View File

@ -56,6 +56,6 @@ public class ExternalMemory {
int round = 2;
//train a boost model
Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
}
}

View File

@ -60,7 +60,7 @@ public class GeneralizedLinearModel {
//train a booster
int round = 4;
Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
float[][] predicts = booster.predict(testMat);

View File

@ -51,7 +51,7 @@ public class PredictFirstNtree {
//train a booster
int round = 3;
Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
//predict use 1 tree
float[][] predicts1 = booster.predict(testMat, false, 1);

View File

@ -49,7 +49,7 @@ public class PredictLeafIndices {
//train a booster
int round = 3;
Booster booster = XGBoost.train(params, trainMat, round, watches, null, null);
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
//predict using first 2 tree
float[][] leafindex = booster.predictLeaf(testMat, 2);

View File

@ -43,7 +43,7 @@ class BasicWalkThrough {
val round = 2
// train a model
val booster = XGBoost.train(params.toMap, trainMax, round, watches.toMap)
val booster = XGBoost.train(trainMax, params.toMap, round, watches.toMap)
// predict
val predicts = booster.predict(testMax)
// save model to model path
@ -78,7 +78,7 @@ class BasicWalkThrough {
val watches2 = new mutable.HashMap[String, DMatrix]
watches2 += "train" -> trainMax2
watches2 += "test" -> testMax2
val booster3 = XGBoost.train(params.toMap, trainMax2, round, watches2.toMap, null, null)
val booster3 = XGBoost.train(trainMax2, params.toMap, round, watches2.toMap, null, null)
val predicts3 = booster3.predict(testMax2)
println(checkPredicts(predicts, predicts3))
}

View File

@ -39,7 +39,7 @@ class BoostFromPrediction {
val round = 2
// train a model
val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
val trainPred = booster.predict(trainMat, true)
val testPred = booster.predict(testMat, true)
@ -48,6 +48,6 @@ class BoostFromPrediction {
testMat.setBaseMargin(testPred)
System.out.println("result of running from initial prediction")
val booster2 = XGBoost.train(params.toMap, trainMat, 1, watches.toMap, null, null)
val booster2 = XGBoost.train(trainMat, params.toMap, 1, watches.toMap, null, null)
}
}

View File

@ -41,6 +41,6 @@ class CrossValidation {
val metrics: Array[String] = null
val evalHist: Array[String] =
XGBoost.crossValidation(params.toMap, trainMat, round, nfold, metrics, null, null)
XGBoost.crossValidation(trainMat, params.toMap, round, nfold, metrics, null, null)
}
}

View File

@ -150,8 +150,8 @@ class CustomObjective {
val round = 2
// train a model
val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
XGBoost.train(params.toMap, trainMat, round, watches.toMap, new LogRegObj, new EvalError)
val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
XGBoost.train(trainMat, params.toMap, round, watches.toMap, new LogRegObj, new EvalError)
}
}

View File

@ -45,7 +45,7 @@ class ExternalMemory {
val round = 2
// train a model
val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
val trainPred = booster.predict(trainMat, true)
val testPred = booster.predict(testMat, true)
@ -54,6 +54,6 @@ class ExternalMemory {
testMat.setBaseMargin(testPred)
System.out.println("result of running from initial prediction")
val booster2 = XGBoost.train(params.toMap, trainMat, 1, watches.toMap, null, null)
val booster2 = XGBoost.train(trainMat, params.toMap, 1, watches.toMap, null, null)
}
}

View File

@ -52,7 +52,7 @@ class GeneralizedLinearModel {
watches += "test" -> testMat
val round = 4
val booster = XGBoost.train(params.toMap, trainMat, 1, watches.toMap, null, null)
val booster = XGBoost.train(trainMat, params.toMap, 1, watches.toMap, null, null)
val predicts = booster.predict(testMat)
val eval = new CustomEval
println(s"error=${eval.eval(predicts, testMat)}")

View File

@ -38,7 +38,7 @@ class PredictFirstNTree {
val round = 3
// train a model
val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
// predict use 1 tree
val predicts1 = booster.predict(testMat, false, 1)

View File

@ -39,7 +39,7 @@ class PredictLeafIndices {
watches += "test" -> testMat
val round = 3
val booster = XGBoost.train(params.toMap, trainMat, round, watches.toMap)
val booster = XGBoost.train(trainMat, params.toMap, round, watches.toMap)
// predict using first 2 tree
val leafIndex = booster.predictLeaf(testMat, 2)

View File

@ -34,7 +34,7 @@ object DistTrainWithFlink {
// number of iterations
val round = 2
// train the model
val model = XGBoost.train(paramMap, trainData, round)
val model = XGBoost.train(trainData, paramMap, round)
val predTest = model.predict(testData.map{x => x.vector})
model.saveModelAsHadoopFile("file:///path/to/xgboost.model")
}

View File

@ -16,29 +16,34 @@
package ml.dmlc.xgboost4j.scala.example.spark
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import ml.dmlc.xgboost4j.scala.DMatrix
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.util.MLUtils
object DistTrainWithSpark {
def main(args: Array[String]): Unit = {
if (args.length != 4) {
if (args.length != 5) {
println(
"usage: program num_of_rounds num_workers training_path model_path")
"usage: program num_of_rounds num_workers training_path test_path model_path")
sys.exit(1)
}
val sc = new SparkContext()
val inputTrainPath = args(2)
val outputModelPath = args(3)
val inputTestPath = args(3)
val outputModelPath = args(4)
// number of iterations
val numRound = args(0).toInt
val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath).repartition(args(1).toInt)
import DataUtils._
val trainRDD = MLUtils.loadLibSVMFile(sc, inputTrainPath)
val testSet = MLUtils.loadLibSVMFile(sc, inputTestPath).collect().iterator
// training parameters
val paramMap = List(
"eta" -> 0.1f,
"max_depth" -> 2,
"objective" -> "binary:logistic").toMap
val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound)
val xgboostModel = XGBoost.train(trainRDD, paramMap, numRound, nWorkers = args(1).toInt)
xgboostModel.predict(new DMatrix(testSet))
// save model to HDFS path
xgboostModel.saveModelAsHadoopFile(outputModelPath)
}

View File

@ -56,7 +56,7 @@ object XGBoost {
val trainMat = new DMatrix(dataIter, null)
val watches = List("train" -> trainMat).toMap
val round = 2
val booster = XGBoostScala.train(paramMap, trainMat, round, watches, null, null)
val booster = XGBoostScala.train(trainMat, paramMap, round, watches, null, null)
Rabit.shutdown()
collector.collect(new XGBoostModel(booster))
}
@ -81,12 +81,13 @@ object XGBoost {
/**
* Train a xgboost model with link.
*
* @param params The parameters to XGBoost.
* @param dtrain The training data.
* @param params The parameters to XGBoost.
* @param round Number of rounds to train.
*/
def train(params: Map[String, Any],
def train(
dtrain: DataSet[LabeledVector],
params: Map[String, Any],
round: Int): XGBoostModel = {
val tracker = new RabitTracker(dtrain.getExecutionEnvironment.getParallelism)
if (tracker.start()) {

View File

@ -37,6 +37,15 @@ class XGBoostModel (booster: Booster) extends Serializable {
.create(new Path(modelPath)))
}
/**
* predict with the given DMatrix
* @param testSet the local test set represented as DMatrix
* @return prediction result
*/
def predict(testSet: DMatrix): Array[Array[Float]] = {
booster.predict(testSet, true, 0)
}
/**
* Predict given vector dataset.
*
@ -44,7 +53,7 @@ class XGBoostModel (booster: Booster) extends Serializable {
* @return The prediction result.
*/
def predict(data: DataSet[Vector]) : DataSet[Array[Float]] = {
val predictMap: Iterator[Vector] => TraversableOnce[Array[Float]] =
val predictMap: Iterator[Vector] => Traversable[Array[Float]] =
(it: Iterator[Vector]) => {
val mapper = (x: Vector) => {
val (index, value) = x.toSeq.unzip

View File

@ -56,9 +56,10 @@ object XGBoost extends Serializable {
trainingSamples =>
rabitEnv.put("DMLC_TASK_ID", TaskContext.getPartitionId().toString)
Rabit.init(rabitEnv.asJava)
val dMatrix = new DMatrix(new JDMatrix(trainingSamples, null))
val booster = SXGBoost.train(xgBoostConfMap, dMatrix, round,
watches = new mutable.HashMap[String, DMatrix]{put("train", dMatrix)}.toMap, obj, eval)
val trainingSet = new DMatrix(new JDMatrix(trainingSamples, null))
val booster = SXGBoost.train(trainingSet, xgBoostConfMap, round,
watches = new mutable.HashMap[String, DMatrix]{put("train", trainingSet)}.toMap,
obj, eval)
Rabit.shutdown()
Iterator(booster)
}.cache()

View File

@ -60,8 +60,8 @@ public class XGBoost {
/**
* Train a booster with given parameters.
*
* @param params Booster params.
* @param dtrain Data to be trained.
* @param params Booster params.
* @param round Number of boosting iterations.
* @param watches a group of items to be evaluated during training, this allows user to watch
* performance on the validation set.
@ -70,8 +70,10 @@ public class XGBoost {
* @return trained booster
* @throws XGBoostError native error
*/
public static Booster train(Map<String, Object> params,
DMatrix dtrain, int round,
public static Booster train(
DMatrix dtrain,
Map<String, Object> params,
int round,
Map<String, DMatrix> watches,
IObjective obj,
IEvaluation eval) throws XGBoostError {
@ -139,8 +141,8 @@ public class XGBoost {
/**
* Cross-validation with given parameters.
*
* @param params Booster params.
* @param data Data to be trained.
* @param params Booster params.
* @param round Number of boosting iterations.
* @param nfold Number of folds in CV.
* @param metrics Evaluation metrics to be watched in CV.
@ -150,8 +152,8 @@ public class XGBoost {
* @throws XGBoostError native error
*/
public static String[] crossValidation(
Map<String, Object> params,
DMatrix data,
Map<String, Object> params,
int round,
int nfold,
String[] metrics,

View File

@ -35,10 +35,10 @@ class DMatrix private[scala](private[scala] val jDMatrix: JDMatrix) {
* init DMatrix from Iterator of LabeledPoint
*
* @param dataIter An iterator of LabeledPoint
* @param cacheInfo Cache path information, used for external memory setting, can be null.
* @param cacheInfo Cache path information, used for external memory setting, null by default.
* @throws XGBoostError native error
*/
def this(dataIter: Iterator[LabeledPoint], cacheInfo: String) {
def this(dataIter: Iterator[LabeledPoint], cacheInfo: String = null) {
this(new JDMatrix(dataIter.asJava, cacheInfo))
}

View File

@ -28,8 +28,8 @@ object XGBoost {
/**
* Train a booster given parameters.
*
* @param params Parameters.
* @param dtrain Data to be trained.
* @param params Parameters.
* @param round Number of boosting iterations.
* @param watches a group of items to be evaluated during training, this allows user to watch
* performance on the validation set.
@ -39,8 +39,8 @@ object XGBoost {
*/
@throws(classOf[XGBoostError])
def train(
params: Map[String, Any],
dtrain: DMatrix,
params: Map[String, Any],
round: Int,
watches: Map[String, DMatrix] = Map[String, DMatrix](),
obj: ObjectiveTrait = null,
@ -49,10 +49,11 @@ object XGBoost {
val jWatches = watches.map{case (name, matrix) => (name, matrix.jDMatrix)}
val xgboostInJava = JXGBoost.train(
dtrain.jDMatrix,
params.map{
case (key: String, value) => (key, value.toString)
}.toMap[String, AnyRef].asJava,
dtrain.jDMatrix, round, jWatches.asJava,
round, jWatches.asJava,
obj, eval)
new Booster(xgboostInJava)
}
@ -60,8 +61,8 @@ object XGBoost {
/**
* Cross-validation with given parameters.
*
* @param params Booster params.
* @param data Data to be trained.
* @param params Booster params.
* @param round Number of boosting iterations.
* @param nfold Number of folds in CV.
* @param metrics Evaluation metrics to be watched in CV.
@ -71,17 +72,17 @@ object XGBoost {
*/
@throws(classOf[XGBoostError])
def crossValidation(
params: Map[String, Any],
data: DMatrix,
params: Map[String, Any],
round: Int,
nfold: Int = 5,
metrics: Array[String] = null,
obj: ObjectiveTrait = null,
eval: EvalTrait = null): Array[String] = {
JXGBoost.crossValidation(params.map{
case (key: String, value) => (key, value.toString)
}.toMap[String, AnyRef].asJava,
data.jDMatrix, round, nfold, metrics, obj, eval)
JXGBoost.crossValidation(
data.jDMatrix, params.map{ case (key: String, value) => (key, value.toString)}.
toMap[String, AnyRef].asJava,
round, nfold, metrics, obj, eval)
}
/**

View File

@ -94,7 +94,7 @@ public class BoosterImplTest {
int round = 5;
//train a boost model
return XGBoost.train(paramMap, trainMat, round, watches, null, null);
return XGBoost.train(trainMat, paramMap, round, watches, null, null);
}
@Test
@ -177,6 +177,6 @@ public class BoosterImplTest {
//do 5-fold cross validation
int round = 2;
int nfold = 5;
String[] evalHist = XGBoost.crossValidation(param, trainMat, round, nfold, null, null, null);
String[] evalHist = XGBoost.crossValidation(trainMat, param, round, nfold, null, null, null);
}
}

View File

@ -74,7 +74,7 @@ class ScalaBoosterImplSuite extends FunSuite {
val watches = List("train" -> trainMat, "test" -> testMat).toMap
val round = 2
XGBoost.train(paramMap, trainMat, round, watches, null, null)
XGBoost.train(trainMat, paramMap, round, watches, null, null)
}
test("basic operation of booster") {
@ -126,6 +126,6 @@ class ScalaBoosterImplSuite extends FunSuite {
"objective" -> "binary:logistic", "gamma" -> "1.0", "eval_metric" -> "error").toMap
val round = 2
val nfold = 5
XGBoost.crossValidation(params, trainMat, round, nfold, null, null, null)
XGBoost.crossValidation(trainMat, params, round, nfold, null, null, null)
}
}