fixed some typos (#1814)

This commit is contained in:
Dr. Kashif Rasul 2016-11-25 22:34:57 +01:00 committed by Yuan (Terry) Tang
parent be2f28ec08
commit da2556f58a
14 changed files with 32 additions and 38 deletions

View File

@ -18,6 +18,6 @@ Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorials/aws
Model Analysis Model Analysis
-------------- --------------
XGBoost is exchangable across all bindings and platforms. XGBoost is exchangeable across all bindings and platforms.
This means you can use python or R to analyze the learnt model and do prediction. This means you can use python or R to analyze the learnt model and do prediction.
For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model. For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.

View File

@ -9,7 +9,7 @@ Guide for Kaggle Higgs Challenge
This is the folder giving example of how to use XGBoost Python Module to run Kaggle Higgs competition This is the folder giving example of how to use XGBoost Python Module to run Kaggle Higgs competition
This script will achieve about 3.600 AMS score in public leadboard. To get start, you need do following step: This script will achieve about 3.600 AMS score in public leaderboard. To get start, you need do following step:
1. Compile the XGBoost python lib 1. Compile the XGBoost python lib
```bash ```bash
@ -29,4 +29,3 @@ speedtest.py compares xgboost's speed on this dataset with sklearn.GBM
Using R module Using R module
===== =====
* Alternatively, you can run using R, higgs-train.R and higgs-pred.R. * Alternatively, you can run using R, higgs-train.R and higgs-pred.R.

View File

@ -152,9 +152,9 @@ Each group at each division level is called a branch and the deepest level is ca
In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits). In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).
**Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been missclassified by the first *tree*. **Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been misclassified by the first *tree*.
In the same way, in Boosting we try to optimize the missclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*. In the same way, in Boosting we try to optimize the misclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*.
The improvement brought by each *split* can be measured, it is the *gain*. The improvement brought by each *split* can be measured, it is the *gain*.
@ -200,7 +200,7 @@ This function gives a color to each bar. These colors represent groups of featur
From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels. From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels.
Or you can just reason about why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information). Or you can just reason about why these features are so important (in **Otto** challenge we can't go this way because there is not enough information).
Tree graph Tree graph
---------- ----------
@ -217,7 +217,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)
We are just displaying the first two trees here. We are just displaying the first two trees here.
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated. On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the interaction between features is complicated.
Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes. Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
Going deeper Going deeper
@ -226,6 +226,6 @@ Going deeper
There are 4 documents you may also be interested in: There are 4 documents you may also be interested in:
* [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation * [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation
* [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysus * [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysis
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case * [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case
* [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model * [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model

View File

@ -19,4 +19,3 @@ You can use the following command to run the example
Get the data: ./wgetdata.sh Get the data: ./wgetdata.sh
Run the example: ./runexp.sh Run the example: ./runexp.sh

View File

@ -41,7 +41,7 @@ Most importantly, it pushes the limit of the computation resources we can use.
How can I port the model to my own system How can I port the model to my own system
----------------------------------------- -----------------------------------------
The model and data format of XGBoost is exchangable, The model and data format of XGBoost is exchangeable,
which means the model trained by one language can be loaded in another. which means the model trained by one language can be loaded in another.
This means you can train the model using R, while running prediction using This means you can train the model using R, while running prediction using
Java or C++, which are more common in production systems. Java or C++, which are more common in production systems.

View File

@ -36,7 +36,6 @@ bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, n
nthread = 2, objective = "binary:logistic") nthread = 2, objective = "binary:logistic")
# predict # predict
pred <- predict(bst, test$data) pred <- predict(bst, test$data)
``` ```
## Julia ## Julia

View File

@ -138,7 +138,7 @@ make the-markdown-to-make.md
- Add the generated figure to the ```dmlc/web-data``` repo. - Add the generated figure to the ```dmlc/web-data``` repo.
- If you already cloned the repo to doc, this means a ```git add``` - If you already cloned the repo to doc, this means a ```git add```
- Create PR for both the markdown and ```dmlc/web-data``` - Create PR for both the markdown and ```dmlc/web-data```
- You can also build the document locally by typing the followig command at ```doc``` - You can also build the document locally by typing the following command at ```doc```
```bash ```bash
make html make html
``` ```

View File

@ -6,7 +6,7 @@ This page contains guidelines to use and develop mxnets.
- [How to Install XGBoost](../build.md) - [How to Install XGBoost](../build.md)
## Use XGBoost in Specific Ways ## Use XGBoost in Specific Ways
- [Parameter tunning guide](param_tuning.md) - [Parameter tuning guide](param_tuning.md)
- [Use out of core computation for large dataset](external_memory.md) - [Use out of core computation for large dataset](external_memory.md)
## Develop and Hack XGBoost ## Develop and Hack XGBoost

View File

@ -12,8 +12,7 @@ train.txt
1 0:0.01 1:0.3 1 0:0.01 1:0.3
0 0:0.2 1:0.3 0 0:0.2 1:0.3
``` ```
Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instanc Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
e being positive.
Additional Information Additional Information
---------------------- ----------------------
@ -54,4 +53,4 @@ train.txt.base_margin
1.0 1.0
3.4 3.4
``` ```
XGBoost will take these values as intial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values. XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values.

View File

@ -32,7 +32,7 @@ This command will publish the xgboost binaries, the compiled java classes as wel
After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running `mvn package`, and you can specify the version of spark with `mvn -Dspark.version=2.0.0 package`. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like `spark.version`, `scala.version`, and `scala.binary.version`. Users also need to change the implemention by replacing SparkSession with SQLContext and the type of API parameters from Dataset[_] to Dataframe) After integrating with Dataframe/Dataset APIs of Spark 2.0, XGBoost4J-Spark only supports compile with Spark 2.x. You can build XGBoost4J-Spark as a component of XGBoost4J by running `mvn package`, and you can specify the version of spark with `mvn -Dspark.version=2.0.0 package`. (To continue working with Spark 1.x, the users are supposed to update pom.xml by modifying the properties like `spark.version`, `scala.version`, and `scala.binary.version`. Users also need to change the implementation by replacing SparkSession with SQLContext and the type of API parameters from Dataset[_] to Dataframe)
Contents Contents
-------- --------

View File

@ -26,7 +26,7 @@ They are also often [much more efficient](http://arxiv.org/abs/1603.02754).
The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Spark/Flink to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file systems and then conduct the following machine learning phase. This process jumping across two types of systems creates certain inconvenience for the users and brings additional overhead to the operators of the infrastructure. The gap between the implementation fundamentals of the general data processing frameworks and the more specific machine learning libraries/systems prohibits the smooth connection between these two types of systems, thus brings unnecessary inconvenience to the end user. The common workflow to the user is to utilize the systems like Spark/Flink to preprocess/clean data, pass the results to machine learning systems like [XGBoost](https://github.com/dmlc/xgboost)/[MxNet](https://github.com/dmlc/mxnet)) via the file systems and then conduct the following machine learning phase. This process jumping across two types of systems creates certain inconvenience for the users and brings additional overhead to the operators of the infrastructure.
We want best of both worlds, so we can use the data processing frameworks like Spark and Flink toghether with We want best of both worlds, so we can use the data processing frameworks like Spark and Flink together with
the best distributed machine learning solutions. the best distributed machine learning solutions.
To resolve the situation, we introduce the new-brewed [XGBoost4J](https://github.com/dmlc/xgboost/tree/master/jvm-packages), To resolve the situation, we introduce the new-brewed [XGBoost4J](https://github.com/dmlc/xgboost/tree/master/jvm-packages),
<b>XGBoost</b> for <b>J</b>VM Platform. We aim to provide the clean Java/Scala APIs and the integration with the most popular data processing systems developed in JVM-based languages. <b>XGBoost</b> for <b>J</b>VM Platform. We aim to provide the clean Java/Scala APIs and the integration with the most popular data processing systems developed in JVM-based languages.

View File

@ -49,7 +49,7 @@ import org.apache.spark.ml.feature.StringIndexer
// load sales records saved in json files // load sales records saved in json files
val salesDF = spark.read.json("sales.json") val salesDF = spark.read.json("sales.json")
// transfrom the string-represented storeType feature to numeric storeTypeIndex // transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer() val indexer = new StringIndexer()
.setInputCol("storeType") .setInputCol("storeType")
.setOutputCol("storeTypeIndex") .setOutputCol("storeTypeIndex")
@ -71,7 +71,7 @@ import org.apache.spark.ml.feature.StringIndexer
// load sales records saved in json files // load sales records saved in json files
val salesDF = spark.read.json("sales.json") val salesDF = spark.read.json("sales.json")
// transfrom the string-represented storeType feature to numeric storeTypeIndex // transform the string-represented storeType feature to numeric storeTypeIndex
val indexer = new StringIndexer() val indexer = new StringIndexer()
.setInputCol("storeType") .setInputCol("storeType")
.setOutputCol("storeTypeIndex") .setOutputCol("storeTypeIndex")
@ -137,5 +137,3 @@ If you are interested in knowing more about XGBoost, you can find rich resources
- [Tutorials for the R package](xgboost.readthedocs.org/en/latest/R-package/index.html) - [Tutorials for the R package](xgboost.readthedocs.org/en/latest/R-package/index.html)
- [Introduction of the Parameters](http://xgboost.readthedocs.org/en/latest/parameter.html) - [Introduction of the Parameters](http://xgboost.readthedocs.org/en/latest/parameter.html)
- [Awesome XGBoost, a curated list of examples, tutorials, blogs about XGBoost usecases](https://github.com/dmlc/xgboost/tree/master/demo) - [Awesome XGBoost, a curated list of examples, tutorials, blogs about XGBoost usecases](https://github.com/dmlc/xgboost/tree/master/demo)

View File

@ -49,7 +49,7 @@ Now we can open the browser, and type(replace the DNS with the master DNS)
``` ```
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088 ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088
``` ```
This will show the job tracker of the YARN cluster. Note that we may wait a few minutes before the master finishes bootstraping and starts the This will show the job tracker of the YARN cluster. Note that we may wait a few minutes before the master finishes bootstrapping and starts the
job tracker. job tracker.
After master machine gets up, we can freely add more slave machines to the cluster. After master machine gets up, we can freely add more slave machines to the cluster.
@ -158,7 +158,7 @@ Application application_1456461717456_0015 finished with state FINISHED at 14564
Analyze the Model Analyze the Model
----------------- -----------------
After the model is trained, we can analyse the learnt model and use it for future prediction task. After the model is trained, we can analyse the learnt model and use it for future prediction task.
XGBoost is a portable framework, the model in all platforms are ***exchangable***. XGBoost is a portable framework, the model in all platforms are ***exchangeable***.
This means we can load the trained model in python/R/Julia and take benefit of data science pipelines This means we can load the trained model in python/R/Julia and take benefit of data science pipelines
in these languages to do model analysis and prediction. in these languages to do model analysis and prediction.