Loading Data and Necessary Packages

library(dplyr)
library(readr)
library(ggplot2)
library(caret)
library(tibble)
library(purrr)
library(corrplot)
library(DescTools)
library(Information)
library(smbinning)
library(lmtest)
library(MASS)
library(pROC)
library(kernlab)
library(viridis)
library(janitor)

job_change_train <- read_csv("job_change_train.csv")
job_change_test <- read_csv("job_change_test.csv")

Preliminary data analysis and processing

Our dataset does not have any missing values. We start with changing variable types, both in the training and test sample.

variable_types_train[,c(1,2)]

##                                      Variable    Type
## id                                         id numeric
## gender                                 gender  factor
## age                                       age numeric
## education                           education  factor
## field_of_studies             field_of_studies  factor
## is_studying                       is_studying  factor
## county                                 county  factor
## relative_wage                   relative_wage numeric
## years_since_job_change years_since_job_change  factor
## years_of_experience       years_of_experience  factor
## hours_of_training           hours_of_training numeric
## is_certified                     is_certified  factor
## size_of_company               size_of_company  factor
## type_of_company               type_of_company  factor
## willing_to_change_job   willing_to_change_job  factor

Feature Selection

First, we remove the “id” variable from both datasets, as it has no value for prediction

We calculate correlations between numeric variables.

job_change_train_correlations["age","relative_wage"]

## [1] 0.3203956

Relative wage is correlated with age, but the correlation is rather weak.

We also check the relationship between numeric variables and target variable using ANOVA:

anova_results[order(anova_results$F_stat,decreasing=TRUE),]

##                        F_stat      p_value
## relative_wage     1676.792861 0.000000e+00
## age                337.065279 2.660737e-74
## hours_of_training    3.524188 6.050284e-02

F statistic is very small for hours_of_training and p-value > 0.05 which means that for significance level of 5% this variable is not useful for modelling our target variable.

We can see clearly that this variable will not be useful for distinction between willing / not willing to change jobs. Because of this we remove this variable from our dataset.

We further investigate the relationship between categorical variables and target variable using Chi-squared test:

##                             p_values
## gender                  4.682101e-17
## education               1.584596e-22
## field_of_studies        1.114671e-08
## is_studying             1.732990e-56
## county                  0.000000e+00
## years_since_job_change  3.733758e-22
## years_of_experience     2.185853e-83
## is_certified            1.023978e-47
## size_of_company        3.669965e-154
## type_of_company        1.219730e-131

All of these variables have p-values significantly less than 0.05, implying that there is very strong evidence of a relationship between these variables and willingness to change jobs. For this reason all of these variables remain in our dataset.

##                              freqRatio percentUnique zeroVar   nzv
## gender                        2.873777    0.03218798   FALSE FALSE
## age                           1.078780    0.24945683   FALSE FALSE
## education                     2.659131    0.04828197   FALSE FALSE
## field_of_studies              5.111717    0.04828197   FALSE FALSE
## is_studying                   3.786796    0.03218798   FALSE FALSE
## county                        1.595144    0.98978032   FALSE FALSE
## relative_wage                 1.909656    0.74837048   FALSE FALSE
## years_since_job_change        2.414419    0.05632896   FALSE FALSE
## years_of_experience           2.352298    0.18508087   FALSE FALSE
## is_certified                  2.574058    0.01609399   FALSE FALSE
## size_of_company               1.917331    0.07242295   FALSE FALSE
## type_of_company               1.594994    0.05632896   FALSE FALSE
## willing_to_change_job         3.025591    0.01609399   FALSE FALSE
## willing_to_change_job_binary  3.025591    0.01609399   FALSE FALSE

We made additional test for zero on non-zero variance of or variables, which came out negative for all of them.

Feature Engeneering

We further decided to look closer into the relationship between the predictors and the target variable.

There is a clear difference in the distribution of wages between those willing and those not willing to change jobs. That’s why we perform binning using a minimum percentage of observation per bin at 5%.

##    Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec
## 1 <= 110.45   2260    1321    939      2260       1321       939 0.1819
## 2 <= 139.65   1558     415   1143      3818       1736      2082 0.1254
## 3  > 139.65   8609    1351   7258     12427       3087      9340 0.6928
## 4   Missing      0       0      0     12427       3087      9340 0.0000
## 5     Total  12427    3087   9340        NA         NA        NA 1.0000
##   GoodRate BadRate   Odds  LnOdds     WoE     IV
## 1   0.5845  0.4155 1.4068  0.3413  1.4484 0.4742
## 2   0.2664  0.7336 0.3631 -1.0131  0.0940 0.0011
## 3   0.1569  0.8431 0.1861 -1.6813 -0.5742 0.1949
## 4      NaN     NaN    NaN     NaN     NaN    NaN
## 5   0.2484  0.7516 0.3305 -1.1071  0.0000 0.6702

Cut points for relative_wage:

## [1] 110.45 139.65

table(job_change_train$relative_wage_WoE)

## 
## -0.5742   0.094  1.4484 
##    8609    1558    2260

We perform the same analysis for age

Althoug not that clearly, we can also see a difference in the distribution of age between those willing and those not willing to change jobs. For this reason we decided to also perform binning for this variable, using the same methods as for relative_wage.

##   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
## 1    <= 23   1050     406    644      1050        406       644 0.0845   0.3867
## 2    <= 27   2978     977   2001      4028       1383      2645 0.2396   0.3281
## 3    <= 32   3184     823   2361      7212       2206      5006 0.2562   0.2585
## 4    <= 34    860     175    685      8072       2381      5691 0.0692   0.2035
## 5     > 34   4355     706   3649     12427       3087      9340 0.3504   0.1621
## 6  Missing      0       0      0     12427       3087      9340 0.0000      NaN
## 7    Total  12427    3087   9340        NA         NA        NA 1.0000   0.2484
##   BadRate   Odds  LnOdds     WoE     IV
## 1  0.6133 0.6304 -0.4613  0.6458 0.0404
## 2  0.6719 0.4883 -0.7169  0.3902 0.0399
## 3  0.7415 0.3486 -1.0539  0.0532 0.0007
## 4  0.7965 0.2555 -1.3646 -0.2575 0.0043
## 5  0.8379 0.1935 -1.6426 -0.5355 0.0867
## 6     NaN    NaN     NaN     NaN    NaN
## 7  0.7516 0.3305 -1.1071  0.0000 0.1720

Cut points for age:

## [1] 23 27 32 34

Values for age variable were replaced with WoE for each interval.

## 
## -0.5355 -0.2575  0.0532  0.3902  0.6458 
##    4355     860    3184    2978    1050

Equivalent transformation for these two variables was performed on the test sample. To do this correctly and avoid information leakage, cut points used for test data were the same as the ones used on train data.

Principal Component Analysis for the “county” variable

We further investigated the county variable. We noticed that it has 123 different values, some of which are very frequent and others are very rare.

## 
## county_118 county_059 county_075 county_110 county_074 county_093 county_028 
##       2825       1771        962        854        557        373        286 
## county_020 county_117 county_119 county_022 county_121 county_024 county_112 
##        204        200        193        190        168        167        165 
## county_032 county_007 county_053 county_029 county_058 county_049 county_011 
##        130        127        126        111        109        101        100 
## county_038 county_068 county_108 county_062 county_040 county_073 county_030 
##         96         93         90         83         81         79         75 
## county_021 county_041 county_092 county_003 county_076 county_001 county_034 
##         72         72         70         68         65         63         63 
## county_122 county_045 county_072 county_082 county_099 county_120 county_102 
##         62         60         60         60         60         54         52 
## county_002 county_009 county_116 county_081 county_046 county_084 county_019 
##         51         50         50         45         43         40         38 
## county_087 county_109 county_057 county_080 county_094 county_097 county_025 
##         38         37         35         33         32         32         31 
## county_077 county_054 county_086 county_006 county_060 county_037 county_018 
##         31         29         27         23         21         19         18 
## county_026 county_085 county_017 county_066 county_088 county_101 county_090 
##         18         18         17         17         17         17         16 
## county_004 county_123 county_005 county_010 county_050 county_023 county_027 
##         15         15         14         14         14         13         13 
## county_044 county_078 county_106 county_014 county_052 county_008 county_042 
##         13         13         13         12         12         11         11 
## county_055 county_067 county_079 county_107 county_036 county_043 county_105 
##         11         11         10         10          9          9          8 
## county_113 county_033 county_035 county_039 county_048 county_070 county_115 
##          8          7          7          7          7          7          7 
## county_063 county_095 county_096 county_100 county_013 county_047 county_114 
##          6          6          6          6          5          5          5 
## county_016 county_031 county_061 county_064 county_015 county_056 county_065 
##          4          4          4          4          3          3          3 
## county_071 county_083 county_091 county_012 county_051 county_098 county_103 
##          3          3          3          2          2          2          2 
## county_104 county_069 county_089 county_111 
##          2          1          1          1

Based on the plot we see that the optimal number of components is 3 (elbow method)

The results contain 3 components (PC1, PC2 and PC3) as new variables (instead of “county”).

Analogous transformation (components based on information only from train dataset) was implemented for the test dataset.

We now have 3 variables for county instead of 123 which we expect to increase the performance of our model.

Final Dataset

Analysis of remaining variables did not provide any indication of the possibility of any meaningful transformation.

Our final train dataset contains 14 predictors:

names(job_change_train_final)

##  [1] "gender"                 "education"              "field_of_studies"      
##  [4] "is_studying"            "years_since_job_change" "years_of_experience"   
##  [7] "is_certified"           "size_of_company"        "type_of_company"       
## [10] "willing_to_change_job"  "relative_wage_WoE"      "age_WoE"               
## [13] "PC1"                    "PC2"                    "PC3"

names(job_change_test_final)

##  [1] "gender"                 "education"              "field_of_studies"      
##  [4] "is_studying"            "years_since_job_change" "years_of_experience"   
##  [7] "is_certified"           "size_of_company"        "type_of_company"       
## [10] "relative_wage_WoE"      "age_WoE"                "PC1"                   
## [13] "PC2"                    "PC3"

Models

Logistic Regression

Logit model

print(summary_stats_logit)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.78627           0.37091           0.92355           0.61592 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.81624           0.46300           0.64723

Probit model

print(probit_summary_stats)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.78498           0.35439           0.92730           0.61703 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.81293           0.45021           0.64085

Logistic regression with additional feature selection:

Balanced accuracy was slightly better for the logit model, that’s why we further used it for our stepwise, backward and forward selection models. All these model were based on the Akaike Information Criterion.

print(stepwise_model_logit_AIC_summary)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.78474           0.36702           0.92281           0.61111 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.81519           0.45861           0.64491

print(backward_model_logit_AIC_summary)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.78474           0.36702           0.92281           0.61111 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.81519           0.45861           0.64491

We can see that both models have exactly the same metrics.

identical(stepwise_vars, backward_vars)

## [1] TRUE

Turns out both models chose exactly the same variables.

print(forward_model_logit_AIC_summary )

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.78627           0.37091           0.92355           0.61592 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.81624           0.46300           0.64723

K-nearest neighbors

We then proceed with KNN models. For every model range standarization of the data was applied.

We start with a model with k = sqrt(n)

sqrt(nrow(job_change_train_final))

## [1] 111.4765

We use k=113 - the first ODD number larger than the square root of n

print(job_change_train_knn113_summary)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.78466           0.36022           0.92495           0.61335 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.81393           0.45388           0.64258

K-nearest neighbors with k optimized through cross-validation

Then we try optimizing the number of neighbours through cross-validation. We use 5 folds 3 times repeated cross-validation.

The optimal number of neighbors was set to:

job_change_train_knn_tuned$finalModel$k

## [1] 45

print(job_change_train_knn_tuned_summary)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.79561           0.42760           0.91724           0.63067 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.82901           0.50965           0.67242

We perform the same model but change the cross-validation metric to ROC.

The optimal number of neighbors was set to:

job_change_train_knn_tuned_roc$finalModel$k

## [1] 101

We repeat the process again, this time for the F1 metric.

The optimal number of neighbors was set to:

job_change_train_knn_tuned_f1$finalModel$k

## [1] 45

print(job_change_train_knn_tuned_f1_summary)

##          Accuracy       Sensitivity       Specificity    Pos Pred Value 
##           0.79440           0.42533           0.91638           0.62703 
##    Neg Pred Value                F1 Balanced Accuracy 
##           0.82832           0.50685           0.67086

Of all KNN models this one gives the best result for balanced accuracy.

At this point we can see a very clear patter in our results - all the models have a problem with very low sensitivity and very high specificity, which means that they are very good in detecting negatives and bad in detecting positives.

We will try to adjust this issue after performing all the remaining models.

Logistic Regression with Regularization

Ridge Regression

Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:

job_change_ridge$bestTune$lambda

## [1] 0.01382622

confusionMatrix(job_change_ridge_fitted,
                job_change_train_final$willing_to_change_job,
                positive='Yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8713 2061
##        Yes  627 1026
##                                                
##                Accuracy : 0.7837               
##                  95% CI : (0.7764, 0.7909)     
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3141               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.33236              
##             Specificity : 0.93287              
##          Pos Pred Value : 0.62069              
##          Neg Pred Value : 0.80871              
##              Prevalence : 0.24841              
##          Detection Rate : 0.08256              
##    Detection Prevalence : 0.13302              
##       Balanced Accuracy : 0.63262              
##                                                
##        'Positive' Class : Yes                  
##

We estimated this model again, this time using accuracy as the cross-validation metric.

Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:

job_change_ridge_2$bestTune$lambda

## [1] 0.01588565

confusionMatrix(job_change_ridge_fitted_2,
                job_change_train_final$willing_to_change_job,
                positive='Yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8713 2061
##        Yes  627 1026
##                                                
##                Accuracy : 0.7837               
##                  95% CI : (0.7764, 0.7909)     
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3141               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.33236              
##             Specificity : 0.93287              
##          Pos Pred Value : 0.62069              
##          Neg Pred Value : 0.80871              
##              Prevalence : 0.24841              
##          Detection Rate : 0.08256              
##    Detection Prevalence : 0.13302              
##       Balanced Accuracy : 0.63262              
##                                                
##        'Positive' Class : Yes                  
##

Although lambda is slightly different, balanced accuracy looks the same as for the previous model.

Lasso Regression

Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:

job_change_lasso$bestTune$lambda

## [1] 0.002612675

confusionMatrix(job_change_lasso_fitted,
                job_change_train_final$willing_to_change_job,
                positive='Yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8668 2008
##        Yes  672 1079
##                                                
##                Accuracy : 0.7843               
##                  95% CI : (0.777, 0.7915)      
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3246               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.34953              
##             Specificity : 0.92805              
##          Pos Pred Value : 0.61622              
##          Neg Pred Value : 0.81191              
##              Prevalence : 0.24841              
##          Detection Rate : 0.08683              
##    Detection Prevalence : 0.14090              
##       Balanced Accuracy : 0.63879              
##                                                
##        'Positive' Class : Yes                  
##

We estimate this model again, this time using accuracy as the cross-validation metric.

Using 5 folds 3 times repeated cross-validation the lambda parameter was set to:

job_change_lasso_2$bestTune$lambda

## [1] 0.0009884959

confusionMatrix(job_change_lasso_fitted_2,
                job_change_train_final$willing_to_change_job,
                positive='Yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8643 1974
##        Yes  697 1113
##                                                
##                Accuracy : 0.7851               
##                  95% CI : (0.7777, 0.7923)     
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3319               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.36054              
##             Specificity : 0.92537              
##          Pos Pred Value : 0.61492              
##          Neg Pred Value : 0.81407              
##              Prevalence : 0.24841              
##          Detection Rate : 0.08956              
##    Detection Prevalence : 0.14565              
##       Balanced Accuracy : 0.64296              
##                                                
##        'Positive' Class : Yes                  
##

Elastic Net Regularization

Using 5 folds 3 times repeated cross-validation model parameters were set to:

job_change_elastic_net$bestTune

##      alpha      lambda
## 1924     1 0.002612675

confusionMatrix(job_change_elastic_net_fitted ,
                job_change_train_final$willing_to_change_job,
                positive='Yes')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8668 2008
##        Yes  672 1079
##                                                
##                Accuracy : 0.7843               
##                  95% CI : (0.777, 0.7915)      
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3246               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.34953              
##             Specificity : 0.92805              
##          Pos Pred Value : 0.61622              
##          Neg Pred Value : 0.81191              
##              Prevalence : 0.24841              
##          Detection Rate : 0.08683              
##    Detection Prevalence : 0.14090              
##       Balanced Accuracy : 0.63879              
##                                                
##        'Positive' Class : Yes                  
##

We estimate this model again, this time using accuracy as the cross-validation metric.

Using 5 folds 3 times repeated cross-validation model parameters were set to:

job_change_elastic_net_2$bestTune

##          alpha       lambda
## 1506 0.7777778 0.0002146141

Support Vector Machine

Linear kernel function

job_change_svm_linear$bestTune

##       C
## 1 0.001

confusionMatrix(job_change_svm_linear_fitted_classes, 
                job_change_train_final$willing_to_change_job,
                positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8602 2054
##        Yes  738 1033
##                                                
##                Accuracy : 0.7753               
##                  95% CI : (0.7679, 0.7826)     
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : 0.000000000319       
##                                                
##                   Kappa : 0.2982               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.33463              
##             Specificity : 0.92099              
##          Pos Pred Value : 0.58329              
##          Neg Pred Value : 0.80724              
##              Prevalence : 0.24841              
##          Detection Rate : 0.08313              
##    Detection Prevalence : 0.14251              
##       Balanced Accuracy : 0.62781              
##                                                
##        'Positive' Class : Yes                  
##

Polynomial kernel function

job_change_svm_poly$bestTune

##   degree scale     C
## 1      2     1 0.001

confusionMatrix(job_change_svm_poly_fitted_classes, 
                                 job_change_train_final$willing_to_change_job,
                                 positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  8471 1308
##        Yes  869 1779
##                                                
##                Accuracy : 0.8248               
##                  95% CI : (0.818, 0.8315)      
##     No Information Rate : 0.7516               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5074               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.5763               
##             Specificity : 0.9070               
##          Pos Pred Value : 0.6718               
##          Neg Pred Value : 0.8662               
##              Prevalence : 0.2484               
##          Detection Rate : 0.1432               
##    Detection Prevalence : 0.2131               
##       Balanced Accuracy : 0.7416               
##                                                
##        'Positive' Class : Yes                  
##

Radial kernel function

job_change_svm_radial$bestTune

##    sigma C
## 13  0.05 1

confusionMatrix(job_change_svm_radial_fitted_classes, 
                                 job_change_train_final$willing_to_change_job,
                                 positive = "Yes")$byClass['Balanced Accuracy']

## Balanced Accuracy 
##         0.8157949

Comparing results from all models

print(ba_compared)

##                                   Balanced Accuracy
## job_change_logit1_cv_ba                   0.6472300
## job_change_probit1_cv_ba                  0.6408500
## stepwise_model_logit_AIC_ba               0.6449100
## backward_model_logit_AIC_ba               0.6449100
## forward_model_logit_AIC_ba                0.6472300
## job_change_train_knn113_ba                0.6425300
## job_change_train_knn_tuned_ba             0.6718300
## job_change_train_knn_tuned_roc_ba         0.6452300
## job_change_train_knn_tuned_f1_ba          0.6718300
## job_change_ridge_ba                       0.6326154
## job_change_ridge_2_ba                     0.6326154
## job_change_lasso_ba                       0.6387908
## job_change_lasso_2_ba                     0.6429595
## job_change_elastic_net_ba                 0.6387908
## job_change_elastic_net_2_ba               0.6456100
## svm_linear_ba                             0.6278071
## svm_poly_ba                               0.7416235
## svm_radial_ba                             0.8157949

Highest balanced accuracy was obtained for SVM model using radial kernel function.

Adjusting cut-off point

Now we try to adress the issue mentioned before. To obtain higher sensitivity and lower specificity we lower the cut off point. We try a few different values below 0.5 and check which one gives us the higher balanced accuracy.

We check results for each model for 4 different cut-off points: 0.2, 0.25 0.3 , 0.35, 0.4, 0.45:

Based on the logit model example we can see, that lowering the cut-off point resulted in a significant increase in sensitivity and relatively low decrease in specificity. Balanced accuracy also increased for each cut-off point.

We perform this analysis for all the remaining models analogously.

print(ba_compared_for_all_cutoffs)

##      job_change_logit1_cv job_change_probit1_cv stepwise_model_logit_AIC
## 0.2             0.7574936             0.7565066                0.7560372
## 0.25            0.7656749             0.7641896                0.7663804
## 0.3             0.7601545             0.7620542                0.7619417
## 0.35            0.7515127             0.7504352                0.7508730
## 0.4             0.7244063             0.7189474                0.7254427
## 0.45            0.6861172             0.6809768                0.6862298
## 0.5             0.6472324             0.6408457                0.6449141
##      backward_model_logit_AIC forward_model_logit_AIC job_change_train_knn113
## 0.2                 0.7560372               0.7574936               0.7462790
## 0.25                0.7663804               0.7656749               0.7559770
## 0.3                 0.7619417               0.7601545               0.7530070
## 0.35                0.7508730               0.7515127               0.7450323
## 0.4                 0.7254427               0.7244063               0.7267768
## 0.45                0.6862298               0.6861172               0.6927854
## 0.5                 0.6449141               0.6472324               0.6414468
##      job_change_train_knn_tuned job_change_train_knn_tuned_roc
## 0.2                   0.7517134                      0.7469077
## 0.25                  0.7585769                      0.7562447
## 0.3                   0.7575573                      0.7533186
## 0.35                  0.7472313                      0.7464791
## 0.4                   0.7324649                      0.7270527
## 0.45                  0.7114901                      0.6947990
## 0.5                   0.6677133                      0.6434962
##      job_change_train_knn_tuned_f1 job_change_ridge job_change_ridge_2
## 0.2                      0.7517134        0.7557050          0.7557050
## 0.25                     0.7585769        0.7648293          0.7648293
## 0.3                      0.7575573        0.7621572          0.7621572
## 0.35                     0.7472313        0.7476529          0.7476529
## 0.4                      0.7324649        0.7112512          0.7112512
## 0.45                     0.7114901        0.6720919          0.6720919
## 0.5                      0.6677133        0.6326154          0.6326154
##      job_change_lasso job_change_lasso_2 job_change_elastic_net
## 0.2         0.7545232          0.7574332              0.7545232
## 0.25        0.7635939          0.7656776              0.7635939
## 0.3         0.7612334          0.7609671              0.7612334
## 0.35        0.7505354          0.7508099              0.7505354
## 0.4         0.7194814          0.7216501              0.7194814
## 0.45        0.6839444          0.6866018              0.6839444
## 0.5         0.6387908          0.6429595              0.6387908
##      job_change_elastic_net_2
## 0.2                 0.7578683
## 0.25                0.7651903
## 0.3                 0.7605896
## 0.35                0.7514056
## 0.4                 0.7244599
## 0.45                0.6863890
## 0.5                 0.6456100

Top 3 models with the highest balanced accuracy are:

print(top_3_models)

##                      Model Max_Balanced_Accuracy Best_Cutoff
## 6            svm_radial_ba             0.8157949   no_cutoff
## 1 stepwise_model_logit_AIC             0.7663804        0.25
## 2 backward_model_logit_AIC             0.7663804        0.25

We can see that the optimal cut-off point (for the two model that use it) is 0.25, which is also confirmed by the percentage of positive values in our (imbalanced) dataset

tabyl(job_change_train_final$willing_to_change_job)

##  job_change_train_final$willing_to_change_job    n   percent
##                                            No 9340 0.7515893
##                                           Yes 3087 0.2484107

Based on the balanced accuracy measure, our final model is SVM with radial kernel function and parameters: sigma = 0.05 and C = 1.

The expected balanced accuracy on the test sample is:

print(svm_radial_ba)

## Balanced Accuracy 
##         0.8157949

Final prediction on test sample

print(tabyl(job_change_final_prediction))

##  job_change_final_prediction    n   percent
##                           No 2774 0.8385732
##                          Yes  534 0.1614268

Machine Learning Model for Predicting Willingness to Change Jobs

Aleksandra Ciesińska, Mateusz Kowalski

June 2024

Loading Data and Necessary Packages

Preliminary data analysis and processing

Feature Selection

Feature Engeneering

Principal Component Analysis for the “county” variable

Final Dataset

Models

Logistic Regression

Logit model

Probit model

Logistic regression with additional feature selection:

K-nearest neighbors

K-nearest neighbors with k optimized through cross-validation

Logistic Regression with Regularization

Ridge Regression

Lasso Regression

Elastic Net Regularization

Support Vector Machine

Linear kernel function

Polynomial kernel function

Radial kernel function

Comparing results from all models

Adjusting cut-off point

Final prediction on test sample