fixed some typos in demos comments
This commit is contained in:
parent
752cf4c95d
commit
b3bffcef34
@ -7,7 +7,7 @@ if (!require(vcd)) {
|
|||||||
}
|
}
|
||||||
# According to its documentation, Xgboost works only on numbers.
|
# According to its documentation, Xgboost works only on numbers.
|
||||||
# Sometimes the dataset we have to work on have categorical data.
|
# Sometimes the dataset we have to work on have categorical data.
|
||||||
# A categorical variable is one which have a fixed number of values. By exemple, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
|
# A categorical variable is one which have a fixed number of values. By example, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
|
||||||
#
|
#
|
||||||
# In R, categorical variable is called Factor.
|
# In R, categorical variable is called Factor.
|
||||||
# Type ?factor in console for more information.
|
# Type ?factor in console for more information.
|
||||||
@ -74,11 +74,11 @@ importance <- xgb.importance(sparse_matrix@Dimnames[[2]], 'xgb.model.dump')
|
|||||||
print(importance)
|
print(importance)
|
||||||
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).
|
# According to the matrix below, the most important feature in this dataset to predict if the treatment will work is the Age. The second most important feature is having received a placebo or not. The sex is third. Then we see our generated features (AgeDiscret). We can see that their contribution is very low (Gain column).
|
||||||
|
|
||||||
# Does these results make sense?
|
# Does these result make sense?
|
||||||
# Let's check some Chi2 between each of these features and the outcome.
|
# Let's check some Chi2 between each of these features and the outcome.
|
||||||
|
|
||||||
print(chisq.test(df$Age, df$Y))
|
print(chisq.test(df$Age, df$Y))
|
||||||
# Pearson correlation between Age and illness disapearing is 35
|
# Pearson correlation between Age and illness disappearing is 35
|
||||||
|
|
||||||
print(chisq.test(df$AgeDiscret, df$Y))
|
print(chisq.test(df$AgeDiscret, df$Y))
|
||||||
# Our first simplification of Age gives a Pearson correlation of 8.
|
# Our first simplification of Age gives a Pearson correlation of 8.
|
||||||
@ -86,6 +86,6 @@ print(chisq.test(df$AgeDiscret, df$Y))
|
|||||||
print(chisq.test(df$AgeCat, df$Y))
|
print(chisq.test(df$AgeCat, df$Y))
|
||||||
# The perfectly random split I did between young and old at 30 years old have a low correlation of 2. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Don't let your "gut" lower the quality of your model. In "data science", there is science :-)
|
# The perfectly random split I did between young and old at 30 years old have a low correlation of 2. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Don't let your "gut" lower the quality of your model. In "data science", there is science :-)
|
||||||
|
|
||||||
# As you can see, in general destroying information by simplying it won't improve your model. Chi2 just demonstrates that. But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
|
# As you can see, in general destroying information by simplifying it won't improve your model. Chi2 just demonstrates that. But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets.
|
||||||
# However it's almost always worse when you add some arbitrary rules.
|
# However it's almost always worse when you add some arbitrary rules.
|
||||||
# Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age. Linear model may not be that strong in these scenario.
|
# Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age. Linear model may not be that strong in these scenario.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user