fixed some typos (#1814)
This commit is contained in:
committed by
Yuan (Terry) Tang
parent
be2f28ec08
commit
da2556f58a
@@ -18,6 +18,6 @@ Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorials/aws
|
||||
|
||||
Model Analysis
|
||||
--------------
|
||||
XGBoost is exchangable across all bindings and platforms.
|
||||
XGBoost is exchangeable across all bindings and platforms.
|
||||
This means you can use python or R to analyze the learnt model and do prediction.
|
||||
For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
Highlights
|
||||
=====
|
||||
Higgs challenge ends recently, xgboost is being used by many users. This list highlights the xgboost solutions of players
|
||||
* Blogpost by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
|
||||
* Blogpost by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
|
||||
* The solution by Tianqi Chen and Tong He [Link](https://github.com/hetong007/higgsml)
|
||||
|
||||
Guide for Kaggle Higgs Challenge
|
||||
@@ -9,7 +9,7 @@ Guide for Kaggle Higgs Challenge
|
||||
|
||||
This is the folder giving example of how to use XGBoost Python Module to run Kaggle Higgs competition
|
||||
|
||||
This script will achieve about 3.600 AMS score in public leadboard. To get start, you need do following step:
|
||||
This script will achieve about 3.600 AMS score in public leaderboard. To get start, you need do following step:
|
||||
|
||||
1. Compile the XGBoost python lib
|
||||
```bash
|
||||
@@ -28,5 +28,4 @@ speedtest.py compares xgboost's speed on this dataset with sklearn.GBM
|
||||
|
||||
Using R module
|
||||
=====
|
||||
* Alternatively, you can run using R, higgs-train.R and higgs-pred.R.
|
||||
|
||||
* Alternatively, you can run using R, higgs-train.R and higgs-pred.R.
|
||||
|
||||
@@ -152,9 +152,9 @@ Each group at each division level is called a branch and the deepest level is ca
|
||||
|
||||
In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).
|
||||
|
||||
**Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been missclassified by the first *tree*.
|
||||
**Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been misclassified by the first *tree*.
|
||||
|
||||
In the same way, in Boosting we try to optimize the missclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*.
|
||||
In the same way, in Boosting we try to optimize the misclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*.
|
||||
|
||||
The improvement brought by each *split* can be measured, it is the *gain*.
|
||||
|
||||
@@ -200,7 +200,7 @@ This function gives a color to each bar. These colors represent groups of featur
|
||||
|
||||
From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels.
|
||||
|
||||
Or you can just reason about why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information).
|
||||
Or you can just reason about why these features are so important (in **Otto** challenge we can't go this way because there is not enough information).
|
||||
|
||||
Tree graph
|
||||
----------
|
||||
@@ -217,7 +217,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)
|
||||
|
||||
We are just displaying the first two trees here.
|
||||
|
||||
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
|
||||
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the interaction between features is complicated.
|
||||
Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
|
||||
|
||||
Going deeper
|
||||
@@ -226,6 +226,6 @@ Going deeper
|
||||
There are 4 documents you may also be interested in:
|
||||
|
||||
* [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation
|
||||
* [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysus
|
||||
* [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysis
|
||||
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case
|
||||
* [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model
|
||||
|
||||
@@ -1,22 +1,21 @@
|
||||
Learning to rank
|
||||
====
|
||||
XGBoost supports accomplishing ranking tasks. In ranking scenario, data are often grouped and we need the [group information file](../../doc/input_format.md#group-input-format) to specify ranking tasks. The model used in XGBoost for ranking is the LambdaRank, this function is not yet completed. Currently, we provide pairwise rank.
|
||||
XGBoost supports accomplishing ranking tasks. In ranking scenario, data are often grouped and we need the [group information file](../../doc/input_format.md#group-input-format) to specify ranking tasks. The model used in XGBoost for ranking is the LambdaRank, this function is not yet completed. Currently, we provide pairwise rank.
|
||||
|
||||
### Parameters
|
||||
The configuration setting is similar to the regression and binary classification setting,except user need to specify the objectives:
|
||||
The configuration setting is similar to the regression and binary classification setting, except user need to specify the objectives:
|
||||
|
||||
```
|
||||
...
|
||||
objective="rank:pairwise"
|
||||
...
|
||||
```
|
||||
For more usage details please refer to the [binary classification demo](../binary_classification),
|
||||
For more usage details please refer to the [binary classification demo](../binary_classification),
|
||||
|
||||
Instructions
|
||||
====
|
||||
The dataset for ranking demo is from LETOR04 MQ2008 fold1,
|
||||
The dataset for ranking demo is from LETOR04 MQ2008 fold1,
|
||||
You can use the following command to run the example
|
||||
|
||||
Get the data: ./wgetdata.sh
|
||||
Run the example: ./runexp.sh
|
||||
|
||||
|
||||
Reference in New Issue
Block a user