vignette text

This commit is contained in:
pommedeterresautee 2015-02-12 22:44:57 +01:00
parent 276b68b984
commit ae9f7e9307
2 changed files with 17 additions and 10 deletions

View File

@ -24,7 +24,7 @@ body{
/ color: white; / color: white;
line-height: 1; line-height: 1;
max-width: 960px; max-width: 800px;
padding: 20px; padding: 20px;
font-size: 17px; font-size: 17px;
} }
@ -131,6 +131,8 @@ code {
border-radius: 4px; border-radius: 4px;
padding: 5px; padding: 5px;
display: inline-block; display: inline-block;
max-width: 800px;
white-space: pre-wrap;
} }
code.r, code.cpp { code.r, code.cpp {
@ -151,7 +153,7 @@ blockquote {
border-left:.5em solid #606AAA; border-left:.5em solid #606AAA;
background: #F8F8F8; background: #F8F8F8;
padding-left: 1em; padding-left: 1em;
margin-left:25px; margin-left:10px;
max-width: 500px; max-width: 500px;
} }
@ -162,14 +164,14 @@ blockquote cite {
} }
blockquote cite:before { blockquote cite:before {
content: '\2014 \00A0'; /content: '\2014 \00A0';
} }
blockquote p { blockquote p {
color: #666; color: #666;
} }
hr { hr {
/ width: 540px; / width: 540px;
text-align: left; text-align: left;
margin: 0 auto 0 0; margin: 0 auto 0 0;
color: #999; color: #999;

View File

@ -89,11 +89,15 @@ data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost') data(agaricus.test, package='xgboost')
train <- agaricus.train train <- agaricus.train
test <- agaricus.test test <- agaricus.test
# Each variable is a S3 object containing both label and data.
``` ```
> In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html). > In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html).
Each variable is a `list` containing both label and data.
```{r dataList, message=F, warning=F}
str(train)
```
Let's discover the dimensionality of our datasets. Let's discover the dimensionality of our datasets.
```{r dataSize, message=F, warning=F} ```{r dataSize, message=F, warning=F}
@ -101,7 +105,7 @@ dim(train$data)
dim(test$data) dim(test$data)
``` ```
> Clearly, we have here a small dataset, however **Xgboost** can manage huge one very efficiently. Clearly, we have here a small dataset, however **Xgboost** can manage huge one very efficiently.
The loaded `data` are stored in `dgCMatrix` which is a *sparse* matrix type and `label` is a `numeric` vector in `{0,1}`. The loaded `data` are stored in `dgCMatrix` which is a *sparse* matrix type and `label` is a `numeric` vector in `{0,1}`.
@ -125,7 +129,7 @@ In *sparse* matrix, cells which contains `0` are not encoded. Therefore, in a da
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
``` ```
> To reach the value of a `S3` object field we use the `$` character. > To reach the value of a variable in a `list` use the `$` character followed by the name.
Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic *R* matrix. Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic *R* matrix.
@ -193,8 +197,9 @@ Here, we have just computed a simple metric: the average error:
* `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; * `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
* `mean(vectorOfErrors)` computes the average error itself. * `mean(vectorOfErrors)` computes the average error itself.
> The most important thing to remember is that to do a classification basically, you just do a regression and then apply a threeshold. The most important thing to remember is that **to do a classification basically, you just do a regression and then apply a threeshold**.
> Multiclass classification works in a very similar way.
Multiclass classification works in a very similar way.
This metrix is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well! This metrix is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!
@ -282,7 +287,7 @@ watchlist <- list(train=dtrain, test=dtest)
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic") bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic")
``` ```
> **Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines of metric here). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. **Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines of metric here). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset. Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.