library(dplyr)
library(readr)
library(ggplot2)
library(caret)
library(tibble)
library(purrr)
library(corrplot)
library(DescTools)
library(Information)
library(smbinning)
library(lmtest)
library(MASS)
library(pROC)
library(kernlab)
library(viridis)
library(janitor)
job_change_train <- read_csv("job_change_train.csv")
job_change_test <- read_csv("job_change_test.csv")
Our dataset does not have any missing values. We start with changing variable types, both in the training and test sample.
## Variable Type
## id id numeric
## gender gender factor
## age age numeric
## education education factor
## field_of_studies field_of_studies factor
## is_studying is_studying factor
## county county factor
## relative_wage relative_wage numeric
## years_since_job_change years_since_job_change factor
## years_of_experience years_of_experience factor
## hours_of_training hours_of_training numeric
## is_certified is_certified factor
## size_of_company size_of_company factor
## type_of_company type_of_company factor
## willing_to_change_job willing_to_change_job factor
First, we remove the “id” variable from both datasets, as it has no value for prediction
We calculate correlations between numeric variables.
## [1] 0.3203956
Relative wage is correlated with age, but the correlation is rather weak.
We also check the relationship between numeric variables and target variable using ANOVA:
## F_stat p_value
## relative_wage 1676.792861 0.000000e+00
## age 337.065279 2.660737e-74
## hours_of_training 3.524188 6.050284e-02
F statistic is very small for hours_of_training and p-value > 0.05 which means that for significance level of 5% this variable is not useful for modelling our target variable.
We can see clearly that this variable will not be useful for distinction between willing / not willing to change jobs. Because of this we remove this variable from our dataset.
We further investigate the relationship between categorical variables and target variable using Chi-squared test:
## p_values
## gender 4.682101e-17
## education 1.584596e-22
## field_of_studies 1.114671e-08
## is_studying 1.732990e-56
## county 0.000000e+00
## years_since_job_change 3.733758e-22
## years_of_experience 2.185853e-83
## is_certified 1.023978e-47
## size_of_company 3.669965e-154
## type_of_company 1.219730e-131
All of these variables have p-values significantly less than 0.05, implying that there is very strong evidence of a relationship between these variables and willingness to change jobs. For this reason all of these variables remain in our dataset.
## freqRatio percentUnique zeroVar nzv
## gender 2.873777 0.03218798 FALSE FALSE
## age 1.078780 0.24945683 FALSE FALSE
## education 2.659131 0.04828197 FALSE FALSE
## field_of_studies 5.111717 0.04828197 FALSE FALSE
## is_studying 3.786796 0.03218798 FALSE FALSE
## county 1.595144 0.98978032 FALSE FALSE
## relative_wage 1.909656 0.74837048 FALSE FALSE
## years_since_job_change 2.414419 0.05632896 FALSE FALSE
## years_of_experience 2.352298 0.18508087 FALSE FALSE
## is_certified 2.574058 0.01609399 FALSE FALSE
## size_of_company 1.917331 0.07242295 FALSE FALSE
## type_of_company 1.594994 0.05632896 FALSE FALSE
## willing_to_change_job 3.025591 0.01609399 FALSE FALSE
## willing_to_change_job_binary 3.025591 0.01609399 FALSE FALSE
We made additional test for zero on non-zero variance of or variables, which came out negative for all of them.
We further decided to look closer into the relationship between the predictors and the target variable.
There is a clear difference in the distribution of wages between those willing and those not willing to change jobs. That’s why we perform binning using a minimum percentage of observation per bin at 5%.
## Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec
## 1 <= 110.45 2260 1321 939 2260 1321 939 0.1819
## 2 <= 139.65 1558 415 1143 3818 1736 2082 0.1254
## 3 > 139.65 8609 1351 7258 12427 3087 9340 0.6928
## 4 Missing 0 0 0 12427 3087 9340 0.0000
## 5 Total 12427 3087 9340 NA NA NA 1.0000
## GoodRate BadRate Odds LnOdds WoE IV
## 1 0.5845 0.4155 1.4068 0.3413 1.4484 0.4742
## 2 0.2664 0.7336 0.3631 -1.0131 0.0940 0.0011
## 3 0.1569 0.8431 0.1861 -1.6813 -0.5742 0.1949
## 4 NaN NaN NaN NaN NaN NaN
## 5 0.2484 0.7516 0.3305 -1.1071 0.0000 0.6702
Cut points for relative_wage:
## [1] 110.45 139.65
##
## -0.5742 0.094 1.4484
## 8609 1558 2260
We perform the same analysis for age
Althoug not that clearly, we can also see a difference in the distribution of age between those willing and those not willing to change jobs. For this reason we decided to also perform binning for this variable, using the same methods as for relative_wage.
## Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
## 1 <= 23 1050 406 644 1050 406 644 0.0845 0.3867
## 2 <= 27 2978 977 2001 4028 1383 2645 0.2396 0.3281
## 3 <= 32 3184 823 2361 7212 2206 5006 0.2562 0.2585
## 4 <= 34 860 175 685 8072 2381 5691 0.0692 0.2035
## 5 > 34 4355 706 3649 12427 3087 9340 0.3504 0.1621
## 6 Missing 0 0 0 12427 3087 9340 0.0000 NaN
## 7 Total 12427 3087 9340 NA NA NA 1.0000 0.2484
## BadRate Odds LnOdds WoE IV
## 1 0.6133 0.6304 -0.4613 0.6458 0.0404
## 2 0.6719 0.4883 -0.7169 0.3902 0.0399
## 3 0.7415 0.3486 -1.0539 0.0532 0.0007
## 4 0.7965 0.2555 -1.3646 -0.2575 0.0043
## 5 0.8379 0.1935 -1.6426 -0.5355 0.0867
## 6 NaN NaN NaN NaN NaN
## 7 0.7516 0.3305 -1.1071 0.0000 0.1720
Cut points for age:
## [1] 23 27 32 34
Values for age variable were replaced with WoE for each interval.
##
## -0.5355 -0.2575 0.0532 0.3902 0.6458
## 4355 860 3184 2978 1050
Equivalent transformation for these two variables was performed on the test sample. To do this correctly and avoid information leakage, cut points used for test data were the same as the ones used on train data.
We further investigated the county variable. We noticed that it has 123 different values, some of which are very frequent and others are very rare.
##
## county_118 county_059 county_075 county_110 county_074 county_093 county_028
## 2825 1771 962 854 557 373 286
## county_020 county_117 county_119 county_022 county_121 county_024 county_112
## 204 200 193 190 168 167 165
## county_032 county_007 county_053 county_029 county_058 county_049 county_011
## 130 127 126 111 109 101 100
## county_038 county_068 county_108 county_062 county_040 county_073 county_030
## 96 93 90 83 81 79 75
## county_021 county_041 county_092 county_003 county_076 county_001 county_034
## 72 72 70 68 65 63 63
## county_122 county_045 county_072 county_082 county_099 county_120 county_102
## 62 60 60 60 60 54 52
## county_002 county_009 county_116 county_081 county_046 county_084 county_019
## 51 50 50 45 43 40 38
## county_087 county_109 county_057 county_080 county_094 county_097 county_025
## 38 37 35 33 32 32 31
## county_077 county_054 county_086 county_006 county_060 county_037 county_018
## 31 29 27 23 21 19 18
## county_026 county_085 county_017 county_066 county_088 county_101 county_090
## 18 18 17 17 17 17 16
## county_004 county_123 county_005 county_010 county_050 county_023 county_027
## 15 15 14 14 14 13 13
## county_044 county_078 county_106 county_014 county_052 county_008 county_042
## 13 13 13 12 12 11 11
## county_055 county_067 county_079 county_107 county_036 county_043 county_105
## 11 11 10 10 9 9 8
## county_113 county_033 county_035 county_039 county_048 county_070 county_115
## 8 7 7 7 7 7 7
## county_063 county_095 county_096 county_100 county_013 county_047 county_114
## 6 6 6 6 5 5 5
## county_016 county_031 county_061 county_064 county_015 county_056 county_065
## 4 4 4 4 3 3 3
## county_071 county_083 county_091 county_012 county_051 county_098 county_103
## 3 3 3 2 2 2 2
## county_104 county_069 county_089 county_111
## 2 1 1 1
Based on the plot we see that the optimal number of components is 3 (elbow method)
The results contain 3 components (PC1, PC2 and PC3) as new variables (instead of “county”).
Analogous transformation (components based on information only from train dataset) was implemented for the test dataset.
We now have 3 variables for county instead of 123 which we expect to increase the performance of our model.
Analysis of remaining variables did not provide any indication of the possibility of any meaningful transformation.
Our final train dataset contains 14 predictors:
## [1] "gender" "education" "field_of_studies"
## [4] "is_studying" "years_since_job_change" "years_of_experience"
## [7] "is_certified" "size_of_company" "type_of_company"
## [10] "willing_to_change_job" "relative_wage_WoE" "age_WoE"
## [13] "PC1" "PC2" "PC3"
## [1] "gender" "education" "field_of_studies"
## [4] "is_studying" "years_since_job_change" "years_of_experience"
## [7] "is_certified" "size_of_company" "type_of_company"
## [10] "relative_wage_WoE" "age_WoE" "PC1"
## [13] "PC2" "PC3"
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.78627 0.37091 0.92355 0.61592
## Neg Pred Value F1 Balanced Accuracy
## 0.81624 0.46300 0.64723
Balanced accuracy was slightly better for the logit model, that’s why we further used it for our stepwise, backward and forward selection models. All these model were based on the Akaike Information Criterion.
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.78474 0.36702 0.92281 0.61111
## Neg Pred Value F1 Balanced Accuracy
## 0.81519 0.45861 0.64491
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.78474 0.36702 0.92281 0.61111
## Neg Pred Value F1 Balanced Accuracy
## 0.81519 0.45861 0.64491
We can see that both models have exactly the same metrics.
## [1] TRUE
Turns out both models chose exactly the same variables.
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.78627 0.37091 0.92355 0.61592
## Neg Pred Value F1 Balanced Accuracy
## 0.81624 0.46300 0.64723
We then proceed with KNN models. For every model range standarization of the data was applied.
We start with a model with k = sqrt(n)
## [1] 111.4765
We use k=113 - the first ODD number larger than the square root of n
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.78466 0.36022 0.92495 0.61335
## Neg Pred Value F1 Balanced Accuracy
## 0.81393 0.45388 0.64258
Then we try optimizing the number of neighbours through cross-validation. We use 5 folds 3 times repeated cross-validation.
The optimal number of neighbors was set to:
## [1] 45
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.79561 0.42760 0.91724 0.63067
## Neg Pred Value F1 Balanced Accuracy
## 0.82901 0.50965 0.67242
We perform the same model but change the cross-validation metric to ROC.
The optimal number of neighbors was set to:
## [1] 101
We repeat the process again, this time for the F1 metric.
The optimal number of neighbors was set to:
## [1] 45
## Accuracy Sensitivity Specificity Pos Pred Value
## 0.79440 0.42533 0.91638 0.62703
## Neg Pred Value F1 Balanced Accuracy
## 0.82832 0.50685 0.67086
Of all KNN models this one gives the best result for balanced accuracy.
At this point we can see a very clear patter in our results - all the models have a problem with very low sensitivity and very high specificity, which means that they are very good in detecting negatives and bad in detecting positives.
We will try to adjust this issue after performing all the remaining models.
Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:
## [1] 0.01382622
confusionMatrix(job_change_ridge_fitted,
job_change_train_final$willing_to_change_job,
positive='Yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8713 2061
## Yes 627 1026
##
## Accuracy : 0.7837
## 95% CI : (0.7764, 0.7909)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3141
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.33236
## Specificity : 0.93287
## Pos Pred Value : 0.62069
## Neg Pred Value : 0.80871
## Prevalence : 0.24841
## Detection Rate : 0.08256
## Detection Prevalence : 0.13302
## Balanced Accuracy : 0.63262
##
## 'Positive' Class : Yes
##
We estimated this model again, this time using accuracy as the cross-validation metric.
Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:
## [1] 0.01588565
confusionMatrix(job_change_ridge_fitted_2,
job_change_train_final$willing_to_change_job,
positive='Yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8713 2061
## Yes 627 1026
##
## Accuracy : 0.7837
## 95% CI : (0.7764, 0.7909)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3141
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.33236
## Specificity : 0.93287
## Pos Pred Value : 0.62069
## Neg Pred Value : 0.80871
## Prevalence : 0.24841
## Detection Rate : 0.08256
## Detection Prevalence : 0.13302
## Balanced Accuracy : 0.63262
##
## 'Positive' Class : Yes
##
Although lambda is slightly different, balanced accuracy looks the same as for the previous model.
Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:
## [1] 0.002612675
confusionMatrix(job_change_lasso_fitted,
job_change_train_final$willing_to_change_job,
positive='Yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8668 2008
## Yes 672 1079
##
## Accuracy : 0.7843
## 95% CI : (0.777, 0.7915)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3246
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.34953
## Specificity : 0.92805
## Pos Pred Value : 0.61622
## Neg Pred Value : 0.81191
## Prevalence : 0.24841
## Detection Rate : 0.08683
## Detection Prevalence : 0.14090
## Balanced Accuracy : 0.63879
##
## 'Positive' Class : Yes
##
We estimate this model again, this time using accuracy as the cross-validation metric.
Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:
## [1] 0.0009884959
confusionMatrix(job_change_lasso_fitted_2,
job_change_train_final$willing_to_change_job,
positive='Yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8643 1974
## Yes 697 1113
##
## Accuracy : 0.7851
## 95% CI : (0.7777, 0.7923)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3319
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.36054
## Specificity : 0.92537
## Pos Pred Value : 0.61492
## Neg Pred Value : 0.81407
## Prevalence : 0.24841
## Detection Rate : 0.08956
## Detection Prevalence : 0.14565
## Balanced Accuracy : 0.64296
##
## 'Positive' Class : Yes
##
Using 5 folds 3 times repeated cross-validation model parameters were set to:
## alpha lambda
## 1924 1 0.002612675
confusionMatrix(job_change_elastic_net_fitted ,
job_change_train_final$willing_to_change_job,
positive='Yes')
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8668 2008
## Yes 672 1079
##
## Accuracy : 0.7843
## 95% CI : (0.777, 0.7915)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3246
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.34953
## Specificity : 0.92805
## Pos Pred Value : 0.61622
## Neg Pred Value : 0.81191
## Prevalence : 0.24841
## Detection Rate : 0.08683
## Detection Prevalence : 0.14090
## Balanced Accuracy : 0.63879
##
## 'Positive' Class : Yes
##
We estimate this model again, this time using accuracy as the cross-validation metric.
Using 5 folds 3 times repeated cross-validation model parameters were set to:
## alpha lambda
## 1506 0.7777778 0.0002146141
## C
## 1 0.001
confusionMatrix(job_change_svm_linear_fitted_classes,
job_change_train_final$willing_to_change_job,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8602 2054
## Yes 738 1033
##
## Accuracy : 0.7753
## 95% CI : (0.7679, 0.7826)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : 0.000000000319
##
## Kappa : 0.2982
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.33463
## Specificity : 0.92099
## Pos Pred Value : 0.58329
## Neg Pred Value : 0.80724
## Prevalence : 0.24841
## Detection Rate : 0.08313
## Detection Prevalence : 0.14251
## Balanced Accuracy : 0.62781
##
## 'Positive' Class : Yes
##
## degree scale C
## 1 2 1 0.001
confusionMatrix(job_change_svm_poly_fitted_classes,
job_change_train_final$willing_to_change_job,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8471 1308
## Yes 869 1779
##
## Accuracy : 0.8248
## 95% CI : (0.818, 0.8315)
## No Information Rate : 0.7516
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5074
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.5763
## Specificity : 0.9070
## Pos Pred Value : 0.6718
## Neg Pred Value : 0.8662
## Prevalence : 0.2484
## Detection Rate : 0.1432
## Detection Prevalence : 0.2131
## Balanced Accuracy : 0.7416
##
## 'Positive' Class : Yes
##
## Balanced Accuracy
## job_change_logit1_cv_ba 0.6472300
## job_change_probit1_cv_ba 0.6408500
## stepwise_model_logit_AIC_ba 0.6449100
## backward_model_logit_AIC_ba 0.6449100
## forward_model_logit_AIC_ba 0.6472300
## job_change_train_knn113_ba 0.6425300
## job_change_train_knn_tuned_ba 0.6718300
## job_change_train_knn_tuned_roc_ba 0.6452300
## job_change_train_knn_tuned_f1_ba 0.6718300
## job_change_ridge_ba 0.6326154
## job_change_ridge_2_ba 0.6326154
## job_change_lasso_ba 0.6387908
## job_change_lasso_2_ba 0.6429595
## job_change_elastic_net_ba 0.6387908
## job_change_elastic_net_2_ba 0.6456100
## svm_linear_ba 0.6278071
## svm_poly_ba 0.7416235
## svm_radial_ba 0.8157949
Highest balanced accuracy was obtained for SVM model using radial kernel function.
Now we try to adress the issue mentioned before. To obtain higher sensitivity and lower specificity we lower the cut off point. We try a few different values below 0.5 and check which one gives us the higher balanced accuracy.
We check results for each model for 4 different cut-off points: 0.2, 0.25 0.3 , 0.35, 0.4, 0.45:
Based on the logit model example we can see, that lowering the cut-off point resulted in a significant increase in sensitivity and relatively low decrease in specificity. Balanced accuracy also increased for each cut-off point.
We perform this analysis for all the remaining models analogously.
## job_change_logit1_cv job_change_probit1_cv stepwise_model_logit_AIC
## 0.2 0.7574936 0.7565066 0.7560372
## 0.25 0.7656749 0.7641896 0.7663804
## 0.3 0.7601545 0.7620542 0.7619417
## 0.35 0.7515127 0.7504352 0.7508730
## 0.4 0.7244063 0.7189474 0.7254427
## 0.45 0.6861172 0.6809768 0.6862298
## 0.5 0.6472324 0.6408457 0.6449141
## backward_model_logit_AIC forward_model_logit_AIC job_change_train_knn113
## 0.2 0.7560372 0.7574936 0.7462790
## 0.25 0.7663804 0.7656749 0.7559770
## 0.3 0.7619417 0.7601545 0.7530070
## 0.35 0.7508730 0.7515127 0.7450323
## 0.4 0.7254427 0.7244063 0.7267768
## 0.45 0.6862298 0.6861172 0.6927854
## 0.5 0.6449141 0.6472324 0.6414468
## job_change_train_knn_tuned job_change_train_knn_tuned_roc
## 0.2 0.7517134 0.7469077
## 0.25 0.7585769 0.7562447
## 0.3 0.7575573 0.7533186
## 0.35 0.7472313 0.7464791
## 0.4 0.7324649 0.7270527
## 0.45 0.7114901 0.6947990
## 0.5 0.6677133 0.6434962
## job_change_train_knn_tuned_f1 job_change_ridge job_change_ridge_2
## 0.2 0.7517134 0.7557050 0.7557050
## 0.25 0.7585769 0.7648293 0.7648293
## 0.3 0.7575573 0.7621572 0.7621572
## 0.35 0.7472313 0.7476529 0.7476529
## 0.4 0.7324649 0.7112512 0.7112512
## 0.45 0.7114901 0.6720919 0.6720919
## 0.5 0.6677133 0.6326154 0.6326154
## job_change_lasso job_change_lasso_2 job_change_elastic_net
## 0.2 0.7545232 0.7574332 0.7545232
## 0.25 0.7635939 0.7656776 0.7635939
## 0.3 0.7612334 0.7609671 0.7612334
## 0.35 0.7505354 0.7508099 0.7505354
## 0.4 0.7194814 0.7216501 0.7194814
## 0.45 0.6839444 0.6866018 0.6839444
## 0.5 0.6387908 0.6429595 0.6387908
## job_change_elastic_net_2
## 0.2 0.7578683
## 0.25 0.7651903
## 0.3 0.7605896
## 0.35 0.7514056
## 0.4 0.7244599
## 0.45 0.6863890
## 0.5 0.6456100
Top 3 models with the highest balanced accuracy are:
## Model Max_Balanced_Accuracy Best_Cutoff
## 6 svm_radial_ba 0.8157949 no_cutoff
## 1 stepwise_model_logit_AIC 0.7663804 0.25
## 2 backward_model_logit_AIC 0.7663804 0.25
We can see that the optimal cut-off point (for the two model that use it) is 0.25, which is also confirmed by the percentage of positive values in our (imbalanced) dataset
## job_change_train_final$willing_to_change_job n percent
## No 9340 0.7515893
## Yes 3087 0.2484107
Based on the balanced accuracy measure, our final model is SVM with radial kernel function and parameters: sigma = 0.05 and C = 1.
The expected balanced accuracy on the test sample is:
## Balanced Accuracy
## 0.8157949