small change in the wording of Otto R markdown

This commit is contained in:
pommedeterresautee 2015-05-08 16:29:29 +02:00
parent fd983dfb97
commit e92d384a6a

View File

@ -42,13 +42,13 @@ Let's explore the dataset.
dim(train)
# Training content
train[1:6,1:5, with =F]
train[1:6, 1:5, with =F]
# Test dataset dimensions
dim(train)
# Test content
test[1:6,1:5, with =F]
test[1:6, 1:5, with =F]
```
> We only display the 6 first rows and 5 first columns for convenience
@ -107,7 +107,7 @@ testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix
Model training
==============
Before the learning we will use the cross validation to evaluate the our error rate.
Before the learning we will use the cross validation to evaluate the error rate.
Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part and use it as the test data. Then it will reintegrate the first part to the training dataset and retain the second part, do a training and so on...
@ -144,21 +144,21 @@ Feature importance
So far, we have built a model made of **`r nround`** trees.
To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
To build a *tree*, the dataset is divided recursively `max.depth` times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
Each division operation is called a *split*.
Each group at each division level is called a branch and the deepest level is called a **leaf**.
Each group at each division level is called a *branch* and the deepest level is called a *leaf*.
In the final model, these leafs are supposed to be as pure as possible for each tree, meaning in our case that each leaf should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).
**Not all splits are equally important**. Basically the first split of a tree will have more impact on the purity that, for instance, the deepest split. Intuitively, we understand that the first split makes most of the work, and the following splits focus on smaller parts of the dataset which have been missclassified by the first tree.
In the same way, in Boosting we try to optimize the missclassification at each round (it is called the **loss**). So the first tree will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous trees.
In the same way, in Boosting we try to optimize the missclassification at each round (it is called the *loss*). So the first tree will do most of the work and the following trees will focus on the remaining, on the parts not correctly learned by the previous trees.
The improvement brought by each split can be measured, it is the **gain**.
The improvement brought by each split can be measured, it is the *gain*.
Each split is done on one feature only at one value.
Each split is done on one feature only at one specific value.
Let's see what the model looks like.
@ -189,7 +189,7 @@ importance_matrix <- xgb.importance(names, model = bst)
xgb.plot.importance(importance_matrix[1:10,])
```
> To make it understandable we first extract the column names from the `Matrix`.
> To make the graph understandable we first extract the column names from the `Matrix`.
Interpretation
--------------
@ -198,9 +198,9 @@ In the feature importance above, we can see the first 10 most important features
This function gives a color to each bar. Basically a K-means clustering is applied to group each feature by importance.
From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels.
From here you can take several actions. For instance you can remove the less important features (feature selection process), or go deeper in the interaction between the most important features and labels.
Or you can just reason about why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information).
Or you can try to guess why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information).
Tree graph
----------
@ -216,7 +216,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)
We are just displaying the first two trees here.
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
Besides, **XGBoost** generate `K` trees at each round for a `K`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
Going deeper
============