Stroke Probability Prediction with Random Forests, XGBoost, CatBoost, LightGBM & MLP¶

Mateusz Kowalski & Aleksandra Ciesińska¶

2024/2025¶

DISCLAIMER: THE ORIGINAL DATASET WAS EXTENDED BY SOME ARTIFICIAL FEATURES TO MAKE THE TASK DIFFERENT FROM A KAGGLE COMPTETITION. THUS, ALL THE FEATURES NAMED FEAT01-FEAT10 ARE TREATED AS IF THEY WERE INFORMATIVE AND POTANTIALLY USEFUL FOR PREDICITON!¶
In [214]:
# basic libraries
import pandas as pd
import math as m
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

# saving & time gauge
import joblib
import time
 
# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.feature_selection import SelectKBest, f_classif, chi2
from imblearn.over_sampling import ADASYN

## cont
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer
## cat
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# grid
from sklearn.model_selection import GridSearchCV

# pipeline
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# metrics
from sklearn.metrics import classification_report ,confusion_matrix, f1_score, balanced_accuracy_score, precision_score, recall_score, roc_auc_score
# other
import missingno as msno

Data preprocessing with ImbPipeline()¶

In [217]:
df = pd.read_csv("C:\\Users\\Mateusz\\Desktop\\Machine Learning II\\project\\c1.csv")
In [219]:
df
Out[219]:
id age avg_glucose_level bmi ever_married feat01 feat02 feat03 feat04 feat05 ... feat08 feat09 feat10 gender heart_disease hypertension Residence_type smoking_status stroke work_type
0 1 75.0 219.82 29.5 Yes 0.475089 0.595032 1.230383 0.845381 1.067904 ... 1.426921 1.454747 0.441987 Female 0 1 Rural formerly smoked 0 Self-employed
1 2 50.0 69.92 18.7 Yes 0.618836 0.432241 1.402145 1.191038 1.395761 ... 1.320794 0.708369 0.348898 Female 0 0 Urban formerly smoked 0 Self-employed
2 3 79.0 72.73 28.4 Yes 0.421711 0.629341 0.916465 1.107330 1.335406 ... 0.669458 0.696877 0.809689 Male 0 0 Rural never smoked 1 Private
3 4 3.0 78.24 16.2 No 0.482635 0.557748 0.762485 0.925121 0.975914 ... 0.420634 1.343314 0.444809 Male 0 0 Rural Unknown 0 children
4 5 53.0 196.25 24.9 Yes 0.435785 0.497572 0.743418 1.007523 0.648819 ... 0.648643 1.168722 0.391860 Female 1 1 Urban smokes 0 Private
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5354 5355 55.0 92.98 25.6 Yes 0.514424 0.512044 0.599711 0.527823 1.484436 ... 0.662278 1.126019 0.255485 Female 0 0 Rural never smoked 1 Self-employed
5355 5356 59.0 109.82 23.7 Yes 0.513911 0.603200 1.471257 0.971114 1.448692 ... 0.671674 1.347132 0.373458 Female 0 0 Urban never smoked 0 Private
5356 5357 31.0 125.38 24.4 Yes 0.458564 0.509621 1.552418 0.719406 0.506281 ... 1.140376 0.736226 0.616051 Female 0 0 Urban smokes 0 Private
5357 5358 62.0 74.32 34.0 Yes 0.311724 0.530730 1.641201 1.242290 1.468356 ... 0.996123 1.050264 0.175882 Female 0 1 Rural never smoked 0 Self-employed
5358 5359 79.0 201.38 31.1 Yes 0.454434 0.493910 0.725071 1.124918 1.487161 ... 0.395280 0.781069 0.398358 Female 1 0 Rural never smoked 0 Private

5359 rows × 22 columns

Categorical features distribution¶

In [222]:
continuous_columns = ["age", "avg_glucose_level", "bmi", "feat01", "feat02", "feat03", "feat04", "feat05", "feat06", "feat07", "feat08", "feat09", "feat10"]
categorical_columns = [col for col in df.columns if col not in continuous_columns and col != "stroke"] # I exclude "stroke"

for col in categorical_columns:
    print(f"Feature: {col}")
    print(df[col].value_counts())
    print("-" * 40)
Feature: id
id
1       1
3580    1
3578    1
3577    1
3576    1
       ..
1786    1
1785    1
1784    1
1783    1
5359    1
Name: count, Length: 5359, dtype: int64
----------------------------------------
Feature: ever_married
ever_married
Yes    3573
No     1786
Name: count, dtype: int64
----------------------------------------
Feature: gender
gender
Female    3135
Male      2223
Other        1
Name: count, dtype: int64
----------------------------------------
Feature: heart_disease
heart_disease
0    5036
1     323
Name: count, dtype: int64
----------------------------------------
Feature: hypertension
hypertension
0    4795
1     564
Name: count, dtype: int64
----------------------------------------
Feature: Residence_type
Residence_type
Urban    2731
Rural    2628
Name: count, dtype: int64
----------------------------------------
Feature: smoking_status
smoking_status
never smoked       1982
Unknown            1591
formerly smoked     955
smokes              831
Name: count, dtype: int64
----------------------------------------
Feature: work_type
work_type
Private          3074
Self-employed     884
Govt_job          690
children          689
Never_worked       22
Name: count, dtype: int64
----------------------------------------
In [387]:
# gender as "other" appears just once! I can easily remove that "outlier", there's no need to create a separate variable for that
df = df[df['gender'] != "Other"].copy()

print(f"Number of rows: {df.shape[0]}")
Number of rows: 5358
In [226]:
df.drop(['id'], axis=1, inplace=True) # irrelevant column

Until now, we have just had a look at the dataset, we deleted the ID column which is redundant and we deleted only one gender = "Other" observation as it does not cost us much information. Also, no seperate variable for this category will be created. In this case, it is nonsensical to attach this 1 category to other existing ones, so it seems to be the best approach just to delete it.

We could also remove "Never_worked" level, as there are only 22 occurrences, however we prefer not to lose so much information. Also, it is again impossible to combine this category with any other, so we will leave it like this.

Missing values and possibly imbalanced data¶

In [230]:
df.isnull().sum() # missing values for bmi
Out[230]:
age                    0
avg_glucose_level      0
bmi                  241
ever_married           0
feat01                 0
feat02                 0
feat03                 0
feat04                 0
feat05                 0
feat06                 0
feat07                 0
feat08                 0
feat09                 0
feat10                 0
gender                 0
heart_disease          0
hypertension           0
Residence_type         0
smoking_status         0
stroke                 0
work_type              0
dtype: int64
In [232]:
# checking for imbalanced data
stroke_counts = df['stroke'].value_counts()

stroke_percentage = stroke_counts / stroke_counts.sum() * 100

# the plot illustrating the problem
plt.figure(figsize=(6, 4))
ax = stroke_counts.plot(kind='bar', color=['purple', 'orchid'])

# % labels
for i, count in enumerate(stroke_counts):
    ax.text(i, count + 10, f'{stroke_percentage[i]:.1f}%', ha='center', fontsize=12)

plt.title('0s and 1s for "stroke" variable')
plt.xlabel('Value')
plt.ylabel('Number')
plt.xticks(rotation=0)
plt.show()
No description has been provided for this image

The data is definitely imbalanced - in order to tackle that problem, Adaptive Synthetic Sampling (ADASYN) is going to be implemented later. It is crucial because otherwise all the models created on so strongly imbalanced data would predict no stroke with very high probability, leading to the fact that the model would become useless. Oversampling seems to be the best approach - downsampling would cause substantial information loss, as there're only 10% 0s. ADASYN, chosen in this project, is a good option because it doesn't just copy the existing data - it works by generating synthetic samples for the minority class based on the feature space of the original dataset. In a nutshell, it calculates the density distribution of each minority class sample and generates synthetic samples according to the density distribution. This adaptive approach ensures that more synthetic samples are generated for minority class samples that are harder to learn, thus improving the classification performance.

Train-test split¶

In [236]:
#X_names = [col for col in df.columns if col != 'stroke']
#y = df[['stroke']]
#X = df[X_names]
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123456789)
In [238]:
#joblib.dump(X_train, 'X_train.pkl')
In [240]:
#joblib.dump(X_test, 'X_test.pkl')
In [242]:
#joblib.dump(y_train, 'y_train.pkl')
In [244]:
#joblib.dump(y_test, 'y_test.pkl')
In [246]:
def load_datasets():
    X_train = joblib.load('X_train.pkl')
    X_test = joblib.load('X_test.pkl')
    y_train = joblib.load('y_train.pkl')
    y_test = joblib.load('y_test.pkl')
    
    print("Datasets loaded successfully!")
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
    return X_train, X_test, y_train, y_test
print('')

In [248]:
X_train, X_test, y_train, y_test = load_datasets()
Datasets loaded successfully!
X_train shape: (4286, 20), y_train shape: (4286, 1)
X_test shape: (1072, 20), y_test shape: (1072, 1)

Preprocessor¶

In [251]:
# different approach towards numerical and categorical data

continuous_columns = ["age", "avg_glucose_level", "bmi", "feat01", "feat02", "feat03", "feat04", "feat05", "feat06", "feat07", "feat08", "feat09", "feat10"]

categorical_columns = [col for col in X_train.columns if col not in continuous_columns and col != "stroke"] # I exclude "stroke"

# optimal k for kNN imputation
n_nbs = m.ceil(np.sqrt(len(X_train[["bmi"]])))

# preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', KNNImputer(n_neighbors=n_nbs, weights="uniform", metric="nan_euclidean")),
            ('scaler', StandardScaler()), 
            ('feature_selection', SelectKBest(f_classif, k=5)) #  up to 5 cont. vars
        ]), continuous_columns),
        
        ('cat', Pipeline([
            ('encoder', OneHotEncoder(handle_unknown='ignore')),   # one-hot encoding
            ('feature_selection', SelectKBest(chi2, k=5))  #  up to 5 cat. vars
        ]), categorical_columns)
    ])
In [253]:
preprocessor.fit(X_train, y_train)
E:\anaconda\Lib\site-packages\sklearn\utils\validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[253]:
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  KNNImputer(n_neighbors=66)),
                                                 ('scaler', StandardScaler()),
                                                 ('feature_selection',
                                                  SelectKBest(k=5))]),
                                 ['age', 'avg_glucose_level', 'bmi', 'feat01',
                                  'feat02', 'feat03', 'feat04', 'feat05',
                                  'feat06', 'feat07', 'feat08', 'feat09',
                                  'feat10']),
                                ('cat',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore')),
                                                 ('feature_selection',
                                                  SelectKBest(k=5,
                                                              score_func=<function chi2 at 0x000001CE84E9B740>))]),
                                 ['ever_married', 'gender', 'heart_disease',
                                  'hypertension', 'Residence_type',
                                  'smoking_status', 'work_type'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  KNNImputer(n_neighbors=66)),
                                                 ('scaler', StandardScaler()),
                                                 ('feature_selection',
                                                  SelectKBest(k=5))]),
                                 ['age', 'avg_glucose_level', 'bmi', 'feat01',
                                  'feat02', 'feat03', 'feat04', 'feat05',
                                  'feat06', 'feat07', 'feat08', 'feat09',
                                  'feat10']),
                                ('cat',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore')),
                                                 ('feature_selection',
                                                  SelectKBest(k=5,
                                                              score_func=<function chi2 at 0x000001CE84E9B740>))]),
                                 ['ever_married', 'gender', 'heart_disease',
                                  'hypertension', 'Residence_type',
                                  'smoking_status', 'work_type'])])
['age', 'avg_glucose_level', 'bmi', 'feat01', 'feat02', 'feat03', 'feat04', 'feat05', 'feat06', 'feat07', 'feat08', 'feat09', 'feat10']
KNNImputer(n_neighbors=66)
StandardScaler()
SelectKBest(k=5)
['ever_married', 'gender', 'heart_disease', 'hypertension', 'Residence_type', 'smoking_status', 'work_type']
OneHotEncoder(handle_unknown='ignore')
SelectKBest(k=5, score_func=<function chi2 at 0x000001CE84E9B740>)

One-hot encoding¶

In [256]:
cat_pipeline = preprocessor.named_transformers_['cat']  # a part regarding categorical variables
encoder = cat_pipeline.named_steps['encoder']  # I'm interested in the encoding process

X_cat_encoded = encoder.transform(X_train[categorical_columns]).toarray()
encoded_cat_feature_names = encoder.get_feature_names_out(categorical_columns)

X_cat_encoded_df = pd.DataFrame(X_cat_encoded, columns=encoded_cat_feature_names)  # as data frame
X_cat_encoded_df.head() # neat table
Out[256]:
ever_married_No ever_married_Yes gender_Female gender_Male heart_disease_0 heart_disease_1 hypertension_0 hypertension_1 Residence_type_Rural Residence_type_Urban smoking_status_Unknown smoking_status_formerly smoked smoking_status_never smoked smoking_status_smokes work_type_Govt_job work_type_Never_worked work_type_Private work_type_Self-employed work_type_children
0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
4 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

Feature selection¶

|As far as feature selection methods are concerned, the choice of the appropriate method relies on what type of targer variable we have and what explanatory variables we use for modelling - numeric or categorical. Here, the target variable is categorical. It means we should use e.g. ANOVA F statistic for numeric and e.g. Chi^2 statistic as far as categorical features are concerned.

In [260]:
selector_cat = cat_pipeline.named_steps['feature_selection'] # now feature selection

chi2_scores = pd.DataFrame({
    'Feature': encoded_cat_feature_names,
    'Chi2-Score': selector_cat.scores_
}).sort_values(by='Chi2-Score', ascending=False) # getting chi^2 scores and sorting them in ascending order

num_pipeline = preprocessor.named_transformers_['num']
selector_num = num_pipeline.named_steps['feature_selection'] # for num ones too

X_num_processed = num_pipeline[:-1].transform(X_train[continuous_columns])  # up until selection
X_num_selected = num_pipeline.transform(X_train[continuous_columns])  # after selection

f_scores = pd.DataFrame({
    'Feature': continuous_columns,
    'F-Score': selector_num.scores_
}).sort_values(by='F-Score', ascending=False) # getting Fs for numerical features

#print("\nChi2 Scores for categorical features:")
#print(chi2_scores)

#print("\nF-Scores for numerical features:")
#print(f_scores) --> I'll make a plot to show that
In [262]:
fig, axes = plt.subplots(2, 1, figsize=(10, 14), sharex=False) # two plots, one each row

# chi2 Scores
axes[0].barh(chi2_scores['Feature'], chi2_scores['Chi2-Score'], color='purple')
axes[0].set_title('Chi2 Scores for Categorical Features', fontsize=16)
axes[0].set_xlabel('Chi2 Score', fontsize=14)
axes[0].invert_yaxis()  # there're a lot of features so it's better to rotate the plot

# f-Scores
axes[1].barh(f_scores['Feature'], f_scores['F-Score'], color='orchid')
axes[1].set_title('F-Scores for Numerical Features', fontsize=16)
axes[1].set_xlabel('F Score', fontsize=14)
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()
No description has been provided for this image

During preprocessor construction we assumed that we would like 5 categorical and 5 numerical features left after selection. Too many would lead to overfitting, too few would oversimplify the analysis and lead to unreliable results too. Having checked the values of statistics, no improvement could be implemented in terms of numerical features. Categorical ones seem good too - the Pipeline would not include all the levels of one variable (ever_married) so all the ones with a score bigger than 30 enter the model. We could perhaps tweak the model a bit by adding work_type information, however the number of features is quite huge at the moment and the desicion is to leave it like this.

Random Forest¶

Random Forests are a popular choice in machine learning due to their simplicity and diversity. Multiple decision trees are constructed and then merged together to get accurate and stable predictions. Mirosław Mamczur, a specialist in the field, proposed a simple, perhaps a bit puerile example explaining the model on his machine learning blog. According to him, the whole procedure can be compared to an event during which a group of 100 cooks try to figure out a perfect soup recipe. Each of them (just like decision trees) adds something special to the dish, creating a super-tasty meal (random forest, way better than the trees alone).

As any machine learning algorithms, Random Forests require hyperparameter tuning. In this project we focused on 6 of them:

  • bootstrap: determines whether bootstrap sampling is used when building individual decision trees. Finally, False option was chosen along the CV process, so sample were drawn without replacement - we have to be careful about overfitting in this scenario! Bootstrap sampling helps to reduce that problem, although our results shown below doesn't seem to indicate any overfitting.
  • max_depth: the maximum depth of each decision tree. Each division increases the depth by one. At the end of the day, max depth equal to 100 was chosen during validation, so the value seems quite big, however we were also considering a tree with no cap on that hyperparameter.
  • max_features: they indicate the size of the random subsets of features to consider when splitting a node. We allowed two possible variants of this parameter: "auto" meaning that no feature subset selection is performed in the trees, so the "random forest" is actually an ensemble of ordinary regression trees and "sqrt" meaning that the maximum is equal to the square root of the number of features. It is concluded that the latter method is more desirable, so being left with "auto" option after validation is a bit disappointing. We expected that the other option would be chosen.
  • min_samples_leaf: the minimum number of samples required to be at a leaf node. Larger values create more generalized models by reducing tree branching, preventing overfitting - the lowest, default value equal to 1 was chosen during validation.
  • min_samples_split: the minimum number of samples required to split an internal node. It controls tree growth as higher values prevent overfitting by enforcing splits only when there are sufficient samples. The optimal value was actually 2, while 5 was an alternative.
  • n_estimators: the number of decision trees used; more trees generally improve performance, but with increased computation time. We decided not to include the default value of 100 in the grid, as it could be insufficient. We started with 200 which was chosen to be the best, with alternatives increasing up to 2000.

Later on we will make some final remarks on this subsection.

In [267]:
# pipeline; only the classifier is going to be changed later when we apply other methods
'''
pipe = ImbPipeline([
    ('preprocessor', preprocessor),
    ('resampler', ADASYN()), # the data needs reasmpling (about 95% of 0s)
    ('classifier', RandomForestClassifier()) # classifier; to be changed later
])

# grid
param_grid_rf = {
    'classifier__bootstrap': [True, False],
    'classifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'classifier__max_features': ['auto', 'sqrt'],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__min_samples_split': [2, 5],
    'classifier__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}

# 3-fold cross-validation (later on 5-fold CV is going to be implemented)
grid_search_rf = GridSearchCV(pipe, param_grid_rf, cv=3, verbose=2, n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

# results
print(f"Best parameters: {grid_search_rf.best_params_}")
print(f"Best score: {grid_search_rf.best_score_}")
test_score = grid_search_rf.score(X_test, y_test)
print(f"Test score: {test_score}")
'''
# Fitting 3 folds for each of 2640 candidates, totalling 7920 fits
print('')

In [269]:
grid_search_rf = joblib.load("grid_search_rf.pkl")
In [271]:
# the best model
best_rf_model = grid_search_rf.best_estimator_
print("Best hyperparameters for Random Forest:", grid_search_rf.best_params_)
y_pred_rf = best_rf_model.predict(X_test)
y_pred_prob_rf = best_rf_model.predict_proba(X_test)[:, 1]
Best hyperparameters for Random Forest: {'classifier__bootstrap': False, 'classifier__max_depth': 100, 'classifier__max_features': 'auto', 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 200}
In [273]:
# cutoff values
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (Random Forest)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_prob_rf = best_rf_model.predict_proba(X_test)[:, 1]
    row, col = divmod(idx, 3)
    y_pred_cutoff = (y_pred_prob_rf >= cutoff).astype(int)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # confusion matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')

# layout adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [274]:
# classification report and metrics basic loop
'''
for idx, cutoff in enumerate(cutoffs):
    y_pred_cutoff = (y_pred_prob_rf >= cutoff).astype(int)
    print(f"\nClassification Report for Cutoff = {cutoff:.2f}:")
    report = classification_report(y_test, y_pred_cutoff)
    print(report)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_rf)

    print("Metrics:")
    print(f"F1 Score: {f1:.3f}")
    print(f"Balanced Accuracy: {balanced_acc:.3f}")
    print(f"Precision: {precision:.3f}")
    print(f"Recall: {recall:.3f}")
    print(f"AUC: {roc_auc:.3f}")
'''
print('Due to the fact that this loop produced a very long, not aesthetic output, it will be replaced by a more neat layout I created with ChatGPT. The basic loop is going to be deleted in the models I create later.')
Due to the fact that this loop produced a very long, not aesthetic output, it will be replaced by a more neat layout I created with ChatGPT. The basic loop is going to be deleted in the models I create later.
In [275]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_rf >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_rf)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [276]:
# joblib.dump(grid_search_rf, "grid_search_rf.pkl")

As this is the simplest algorithm we used and we already know other results, the conclusion here is going to be rather sparse. First of all, the tuning process in many cases led to the choice of values that might have caused overfitting. However, the results doesn't seem to show that.

All the models turned out to be subpar, the best ones with the cut-off point between 0.5 and 0.6. Balanced accuracy revolves around 65%, F1 and presicion around 40%. The recall value is between 30 and 40%. Definitely there is much room for improvement. The prediction of 0s seem perfect, however it is nothing exceptional if our dataset is so imbalanced and 0 is the majority class. The main challenge is to improve the prediction of 1s.

XGBoost¶

The term “gradient boosting” comes from the idea of “boosting” or improving a single weak model by combining it with a number of other weak models in order to generate a collectively strong model.

In particular, XGBoost uses second-order gradients of the loss function in addition to the first-order gradients, based on Taylor expansion of the loss function. In addition to this, XGBoost transforms the loss function into a more sophisticated objective function containig* regularisation tem*s. This extension of the loss function adds penalty terms for adding new decision tree leaves to the model with penalty proportional to the size of the leaf weights. This inhibits the growth of the model in order to prevent overfitting

We have optimized the following hyperparameters:

  • n_estimators: defined as previously; 500 was chosen, so a value in the middle.
  • learning_rate: a lower learning rate makes the model learn more slowly and thus it requires more trees to converge. Higher values may lead to overfitting. The value of 0.05 was chosen, so something in the middle of considered values.
  • max_depth: defined as previously; 9 was chosen, so a value in the middle.
  • min_child_weight: defined as min_samples_split before; 1 was chosen.
  • subsample: a fraction of samples to be used for training each tree. A lower value makes the model more regularised and prevents overfitting, however the cross-validation process suggested the choice of 100%.
  • colsample_bytree: the fraction of features to randomly sample for each tree. 60% was chosen during cross-validation. We must note that lower values may result in underfitting if the model might not have enough information to make strong splits.
  • gamma: the minimum loss reduction required to make a further partition on a leaf node; a value of 0 means no regularization on the splits. .

~ Inspired the articleby: https://developer.nvidia.com/blog/gradient-boosting-decision-trees-xgboost-cuda/

In [280]:
'''
pipe = ImbPipeline([
    ('preprocessor', preprocessor),
    ('resampler', ADASYN()),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss')) # XGB now
])

param_grid_xgb = {
    'classifier__n_estimators': [100, 300, 500, 700],
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__max_depth': [3, 6, 9, 12],
    'classifier__min_child_weight': [1, 3, 5],
    'classifier__subsample': [0.6, 0.8, 1.0], # exclude 1 during the next attempt
    'classifier__colsample_bytree': [0.6, 0.8, 1.0], # exclude 1 during the next attempt
    'classifier__gamma': [0, 1, 5]   
}

# 5-fold CV
grid_search_xgb = GridSearchCV(
    pipe, 
    param_grid_xgb, 
    cv=5,
    verbose=2, 
    n_jobs=-1,
    scoring='accuracy'
)

# results
grid_search_xgb.fit(X_train, y_train)
print(f"Best parameters: {grid_search_xgb.best_params_}")
print(f"Best validation result: {grid_search_xgb.best_score_}")
'''
# Fitting 5 folds for each of 5184 candidates, totalling 25920 fits
print('')

In [281]:
grid_search_xgb = joblib.load("grid_search_xgb.pkl")
In [282]:
# the best model
best_xgb_model = grid_search_xgb.best_estimator_
print("Best hyperparameters for XGBoost:", grid_search_xgb.best_params_)
y_pred_xgb = best_xgb_model.predict(X_test)
y_pred_prob_xgb = best_xgb_model.predict_proba(X_test)[:, 1]
Best hyperparameters for XGBoost: {'classifier__colsample_bytree': 0.6, 'classifier__gamma': 0, 'classifier__learning_rate': 0.05, 'classifier__max_depth': 9, 'classifier__min_child_weight': 1, 'classifier__n_estimators': 500, 'classifier__subsample': 1.0}
In [283]:
# cutoff values
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (XGBoost)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_prob_xgb = best_xgb_model.predict_proba(X_test)[:, 1]
    row, col = divmod(idx, 3)
    y_pred_cutoff = (y_pred_prob_xgb >= cutoff).astype(int)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # plotting confusion matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')

# layout adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [284]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_xgb >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_xgb)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [285]:
# joblib.dump(grid_search_xgb, "grid_search_xgb.pkl")

To sum up this subsection, first of all we were slightly worried by 3 tuned hyperparameters at the bottom of the grid. The whole samples were used for training each tree and a relatively low percentage of features took part in random sampling for each tree. Also, gamma parameter was set at 0. We would perhaps be more confident if this part of the final grid was different.

The results we obtained are just slighly better than before - again the best XGBoost models were the ones with cut-off points between 0.5 and 0.6. Balanced accuracy revolves around 67%, F1 around 40% and precision is increasing from 40% to 50%. The recall value increases from 35 to 40% with an increase of the cut-off point. Definitely there is still room for improvement. As previously, the main challenge is to improve the prediction of 1s.

CatBoost¶

CatBoost is yet another example of gradient boosting. Unlike other machine learning algorithms that require categorical variables to be converted into numerical format, CatBoost allows us to omit that step. However, as we have a pipeline to preprocess the data for all models very quickly, we will not benefit from that.

According to A. Kolli, there are 3 main advantages of CatBoost:

  1. Handling Categorical Features: already explained.

  2. Ordered Boosting:

One of the core innovations of CatBoost is its ordered boosting mechanism. Traditional gradient boosting methods can suffer from prediction shift due to the overlap between the training data for the base models and the data used to calculate the gradients. CatBoost addresses this by introducing a random permutation of the dataset in each iteration and using only the data before each example in the permutation for training. This approach reduces overfitting and improves model robustness.

  1. Symmetric Trees:

CatBoost builds balanced trees, also known as symmetric trees, as its base predictors. Unlike traditional gradient boosting methods that build trees leaf-wise or depth-wise, CatBoost’s symmetric trees ensure that all leaf nodes at the same level share the same decision rule. This leads to faster execution and reduces the likelihood of overfitting.

We do agree with that. The execution of the code was efficient and as shown later, the results were better than before. Let's just focus on the hyperparameters tuned in this model:

  • iterations: as n_estimators before; the number of trees chosen now was 500.
  • depth: as max_depth before; the value chosen was only 2, however it is normal for CatBoost to have this parameter between 2 and 10. Usually, the value chosen is 6.
  • learning_rate: explained before; here 0.01 was suggested after cross-validation.
  • l2_leaf_reg: L2 regularization coefficient applied to the leaf nodes of the trees (helping to penalize large weights in the leaf nodes). The final value was 1, so not much. Usually the parameter is set between 1 and 30.
  • border_count: the number of discrete bins for continuous features as CatBoost transforms continuous features into categorical ones by dividing the values into intervals; 32 bins were chosen.

~ Inspired by: https://aravindkolli.medium.com/understanding-catboost-the-gradient-boosting-algorithm-for-categorical-data-73ddb200895d

In [291]:
'''
pipe_cb = ImbPipeline([
    ('preprocessor', preprocessor),
    ('resampler', ADASYN()),
    ('classifier', CatBoostClassifier(silent=True))
])

param_grid_cb = {
    'classifier__iterations': [100, 200, 500],
    'classifier__depth': [2, 6, 8, 10],
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__l2_leaf_reg': [1, 3, 5, 7],
    'classifier__border_count': [32, 64, 128]
}

grid_search_cb = GridSearchCV(
    estimator=pipe_cb,
    param_grid=param_grid_cb,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_search_cb.fit(X_train, y_train.values.ravel())  
'''
print('')

In [292]:
grid_search_cb = joblib.load("grid_search_cb.pkl")
In [293]:
# the best model
best_cb_model = grid_search_cb.best_estimator_
print("Best hyperparameters for CatBoost:", grid_search_cb.best_params_)
y_pred_cb = best_cb_model.predict(X_test)
y_pred_prob_cb = best_cb_model.predict_proba(X_test)[:, 1]
Best hyperparameters for CatBoost: {'classifier__border_count': 32, 'classifier__depth': 2, 'classifier__iterations': 500, 'classifier__l2_leaf_reg': 1, 'classifier__learning_rate': 0.01}
In [294]:
# cutoffs between 0.5 and 0.9 are to be tested
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (CatBoost)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_prob_cb = best_cb_model.predict_proba(X_test)[:, 1]
    row, col = divmod(idx, 3)
    y_pred_cutoff = (y_pred_prob_cb >= cutoff).astype(int)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # plotting the confusion matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')

# plot adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [295]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_cb >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_cb)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [296]:
# joblib.dump(grid_search_cb, "grid_search_cb.pkl")

Thanks to CatBoost we managed to get some quite impressive results. The optimal cut-off seems to be 0.6 with F1 average of 48.4%, balanced accuracy of 82.6%, precision of 34% and recall of 83.5%. The main observation is that we finally managed to improve poor FNs prediction, so we now avoid a sitiuation in which the model predicts no stroke while stroke actually occurred. Thus, the recall value was substantially bigger and it also improved the overall [balanced] accuracy. Precision, so the proportion of positive predictions, is low, however in the context of this analysis it does not bother us that much. It is not that bad if the model predicts the stroke if finally it was not observed - the opposite situation is much worse. However, we would still like to improve that.

LightGBM¶

As an article by L. Soyoung states, LightGBM was developed to address the computational inefficiencies of traditional Gradient Boosting Machines (GBM), which process all data instances across all features, leading to heavy computation. To mitigate this, LightGBM introduces two key techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).

GOSS focuses on sampling instances based on their gradients, prioritizing those with larger gradients for training. EFB optimizes memory usage by bundling exclusive features together, allowing for more efficient processing and faster training without compromising the performance significantly. Together, these techniques make LightGBM significantly faster and more scalable compared to traditional GBM, while maintaining high prediction accuracy.

Undoubtedly, we noticed the efficiency of LightGBM - we managed to run over 34k fits quite quickly with promising results, 34k being the biggest number of tested fits within this project. We optimized the following hyperparameters:

  • n_estimators: as previously; 100 trees were used, although bigger values were available.
  • max_depth: the maximum depth of trees, just like in CatBoost, values up to 10 are used. It is worth noting that setting this parameter to -1 allows trees to grow until all leaves contain less than min_child_samples value.
  • learning_rate: as previously; 0.05 was chosen, so a value somewhat in the middle.
  • num_leaves: the maximum number of leaves at a tree, it is worth noting that the values often revolve around 2^max_depth (optionally -1).
  • min_child_samples: as previously; 50 was chosen, so quite a high value. We know that we can now avoid too detailed splits, so we do not expect overfitting.
  • reg_alpha: L1 regularization applied to leaf weights to reduce overfitting, where a higher value penalizes large weights in the leaf nodes, making the model more conservative (so less prone to overfitting). The highest value possible (10) was chosen.
  • reg_lambda: L2 regularization term applied to leaf weights, also to reduce overfitting. It is often paired with L1 to stenghten the effect. Again, the term equal to 10 was chosen.

~ Inspired by: https://medium.com/@soyoungluna/simple-explanation-of-lightgbm-without-complicated-mathematics-973998ec848f

In [300]:
'''
start_time = time.time()

pipe_lgbm = ImbPipeline([
    ('preprocessor', preprocessor),
    ('resampler', ADASYN()), 
    ('classifier', LGBMClassifier())
])

param_grid_lgbm = {
    'classifier__n_estimators': [100, 200, 500],
    'classifier__max_depth': [6, 8, 10, -1],  
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__num_leaves': [31, 63, 127], 
    'classifier__min_child_samples': [10, 20, 50], 
    'classifier__reg_alpha': [0, 0.1, 1, 10], 
    'classifier__reg_lambda': [0, 0.1, 1, 10] 
}

grid_search_lgbm = GridSearchCV(
    estimator=pipe_lgbm,
    param_grid=param_grid_lgbm,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_search_lgbm.fit(X_train, y_train.values.ravel())

# results
best_lgbm_model = grid_search_lgbm.best_estimator_
print("Best hyperparameters for LightGBM:", grid_search_lgbm.best_params_)
print("Best cross-validated AUC score for LightGBM:", round(grid_search_lgbm.best_score_, 3))
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} s")
'''
print('')

In [301]:
grid_search_lgbm = joblib.load("grid_search_lgbm.pkl")
In [302]:
# the best model and predictions
best_lgbm_model = grid_search_lgbm.best_estimator_
print("Best hyperparameters for LightGBM:", grid_search_lgbm.best_params_)
y_pred_lgbm = best_lgbm_model.predict(X_test)
y_pred_prob_lgbm = best_lgbm_model.predict_proba(X_test)[:, 1]
Best hyperparameters for LightGBM: {'classifier__learning_rate': 0.05, 'classifier__max_depth': 6, 'classifier__min_child_samples': 50, 'classifier__n_estimators': 100, 'classifier__num_leaves': 127, 'classifier__reg_alpha': 10, 'classifier__reg_lambda': 10}
In [303]:
# cutoff values
cutoffs = np.arange(0.5, 0.95, 0.05)

# a 3x3 canvas is desired
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (LightGBM)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_prob_lgbm = best_lgbm_model.predict_proba(X_test)[:, 1]
    row, col = divmod(idx, 3)
    y_pred_cutoff = (y_pred_prob_lgbm >= cutoff).astype(int)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # plotting the confusion matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [304]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_lgbm >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_lgbm)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [305]:
# joblib.dump(grid_search_lgbm, "grid_search_lgbm.pkl")

The results for LGBM are quite impressive. The hyperparameter selection includes much regularisation, which makes it very likely that there is no overfitting. The optimal cut-off seems to be 0.5 with F1 average of 45.9%, balanced accuracy of 79.9%, precision of 32.6% and recall of 78%. The main observation is that it is almost the same we obtained for CatBoost, although each metric decreased by about 2 p.p. lower. Thus, both LGBM and CatBoost seem solid, however CatBoost produced slightly better results.

Multilayer Perceptron | 1st run¶

Multi-Layer Perceptrons are feedforward artificial neural networks that consist of an input layer, usually a few hidden layers, and an output layer. Each neuron in a layer is fully connected to the neurons in the subsequent layer, and the model learns by adjusting weights through backpropagation. This method is exceptionally good at detecting non-linear relationships, however requires more resources compared to machine learning models used before. The cost is so high that we were only able to get <1k fits, still waiting very long for results. We focused on optimizing hyperparameters regarding the number of neurons within 5 hidden layers, the activation function, regularisation and the so-called early stopping. More precisely, the hyperparameters in the whole grid are:

  • solver: just a solver algorithm, not that important to discuss.
  • learning_rate_init: as previously; we set a very low value of 0.0001 as we do test other values of this parameter.
  • max_iter: the maximum number of iterations for training; we decided to choose 300, a bit more than default 200.
  • hidden_layers: larger and deeper layers increase the power of the model but may lead to overfitting/longer training times; (400, 400, 400, 400, 400), so the biggest possible size we proposed, was chosen.
  • activation: logistic: the basic one prone to vanishing gradients, tanh: providing better gradient flow than logistic, especially for deeper networks & relu: the most popular activation for deep learning, offering fast computation and mitigating vanishing gradients. In hindsight, we would probably set the relu function as the only possible one (logistic function was chosen after validation) and differentiate the grid over other hyperparameters.
  • alpha: the L2 regularization term working as previously; 0.001, so the value in the middle, was chosen.
  • early_stopping: allows to stop training early if validation performance does not improve; False was chosen after validation, however in hindsight we would set this as "True".

Since in our opinion there was some room for improvement in terms of grid contruction, we decided to try a few more MLP runs in other subsections.

In [309]:
'''
start_time = time.time()

pipe_mlp = ImbPipeline([
    ('preprocessor', preprocessor),
    ('resampler', ADASYN()),
    ('classifier', MLPClassifier()) 
])

param_grid_mlp = {
    'classifier__solver': ['adam'],
    'classifier__learning_rate_init': [0.0001],
    'classifier__max_iter': [300],
    'classifier__hidden_layer_sizes': [
        (500, 400, 300, 200, 100),
        (400, 400, 400, 400, 400),
        (300, 300, 300, 300, 300),
        (200, 200, 200, 200, 200)
    ],
    'classifier__activation': ['logistic', 'tanh', 'relu'],
    'classifier__alpha': [0.0001, 0.001, 0.005],
    'classifier__early_stopping': [True, False]
}

grid_search_mlp = GridSearchCV(
    estimator=pipe_mlp,
    param_grid=param_grid_mlp,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=2
)

grid_search_mlp.fit(X_train, y_train.values.ravel())

best_mlp_model = grid_search_mlp.best_estimator_
print("Best hyperparameters for MLPClassifier:", grid_search_mlp.best_params_)

print("Best cross-validated AUC score for MLPClassifier:", round(grid_search_mlp.best_score_, 3))

end_time = time.time()

execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} s")
'''
print('')

In [310]:
grid_search_mlp = joblib.load("grid_search_mlp.pkl")
In [311]:
# the best model
best_mlp_model = grid_search_mlp.best_estimator_
print("Best hyperparameters for MLPClassifier:", grid_search_mlp.best_params_)

# predictions
y_pred_mlp = best_mlp_model.predict(X_test)
y_pred_prob_mlp = best_mlp_model.predict_proba(X_test)[:, 1]
Best hyperparameters for MLPClassifier: {'classifier__activation': 'logistic', 'classifier__alpha': 0.001, 'classifier__early_stopping': False, 'classifier__hidden_layer_sizes': (400, 400, 400, 400, 400), 'classifier__learning_rate_init': 0.0001, 'classifier__max_iter': 300, 'classifier__solver': 'adam'}
In [312]:
# cutoffs between 0.5 and 0.9 are to be tested
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (MLP 1st run)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)
    row, col = divmod(idx, 3)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # cm plot
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')
    
# layout adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [313]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_mlp)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image

Even though not many fits were compared, the results are very good. The cut-off equal to 0.6 seems to be the best trade-off between the number of bad predictions for 0s and 1s. F1 average is equal to 43.2%, balanced accuracy exceeds 82%, precision is slightly lower than 30% and recall is the highest across all models - it is equal to around 90%.

Again, the precision value is subpar, however we pay more attention to recall. We managed to greatly avoid the prediction of no stroke when stroke actually occurred. Also, balanced accuracy is very impressive, similar to CatBoost. However, we believe that if more MLPs were trained, we would get explicitly the best results using this method.

Multilayer Perceptron | 2nd run¶

In [321]:
'''
param_grid_mlp = {
    'classifier__solver': ['adam'],
    'classifier__learning_rate_init': [0.001, 0.01, 0.05, 0.1],       <-- 3 more learning rates
    'classifier__max_iter': [300],
    'classifier__hidden_layer_sizes': [
        (500, 400, 300, 200, 100),
        (400, 400, 400, 400, 400),
        (300, 300, 300, 300, 300),
        (200, 200, 200, 200, 200)
    ],
    'classifier__activation': ['tanh', 'relu'],                       <-- sigmoid activation function excluded
    'classifier__alpha': [0.0001, 0.005, 0.01],
    'classifier__early_stopping': [True]
}
'''
print('')

In [322]:
grid_search_mlp = joblib.load("grid_search_mlp_2.pkl")
In [323]:
# the best model
best_mlp_model = grid_search_mlp.best_estimator_
print("Best hyperparameters for MLPClassifier:", grid_search_mlp.best_params_)

# predictions
y_pred_mlp = best_mlp_model.predict(X_test)
y_pred_prob_mlp = best_mlp_model.predict_proba(X_test)[:, 1]
Best hyperparameters for MLPClassifier: {'classifier__activation': 'tanh', 'classifier__alpha': 0.005, 'classifier__early_stopping': True, 'classifier__hidden_layer_sizes': (500, 400, 300, 200, 100), 'classifier__learning_rate_init': 0.05, 'classifier__max_iter': 300, 'classifier__solver': 'adam'}
In [324]:
# cutoffs between 0.5 and 0.9 are to be tested
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (MLP 2nd run)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)
    row, col = divmod(idx, 3)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # cm plot
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')
    
# layout adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [325]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff, zero_division=0) # zero_division to avoid warnings (as instructed in the warning)

    f1 = f1_score(y_test, y_pred_cutoff, zero_division=0)
    precision = precision_score(y_test, y_pred_cutoff, zero_division=0)
    recall = recall_score(y_test, y_pred_cutoff, zero_division=0)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_mlp)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image

Multilayer Perceptron | 3rd run¶

In [327]:
'''
param_grid_mlp = {
    'classifier__solver': ['adam'],
    'classifier__learning_rate_init': [0.001, 0.01, 0.05, 0.1],
    'classifier__max_iter': [300],
    'classifier__hidden_layer_sizes': [
        (512, 256, 128),         # less and less neurons in subsequent layers
        (256, 256, 256),         # symmetrical distribution of neurons
        (128, 256, 128),         # peak in the middle
        (128, 64, 32, 16),       # small architecture in terms of total number of neurons
        (1024, 512),             # only 2 layers but big ones
    ],
    'classifier__activation': ['tanh', 'relu'],
    'classifier__alpha': [0.0001, 0.001, 0.01],
    'classifier__early_stopping': [True, False]
}
'''
print('')

In [328]:
grid_search_mlp = joblib.load("grid_search_mlp_3.pkl")
In [329]:
# the best model
best_mlp_model = grid_search_mlp.best_estimator_
print("Best hyperparameters for MLPClassifier:", grid_search_mlp.best_params_)

# predictions
y_pred_mlp = best_mlp_model.predict(X_test)
y_pred_prob_mlp = best_mlp_model.predict_proba(X_test)[:, 1]
Best hyperparameters for MLPClassifier: {'classifier__activation': 'tanh', 'classifier__alpha': 0.005, 'classifier__early_stopping': True, 'classifier__hidden_layer_sizes': (256, 256, 256), 'classifier__learning_rate_init': 0.05, 'classifier__max_iter': 300, 'classifier__solver': 'adam'}
In [330]:
# cutoffs between 0.5 and 0.9 are to be tested
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (MLP)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)
    row, col = divmod(idx, 3)
    cm = confusion_matrix(y_test, y_pred_cutoff)

    # cm plot
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')
    
# layout adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [331]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)

    report = classification_report(y_test, y_pred_cutoff)

    f1 = f1_score(y_test, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
    precision = precision_score(y_test, y_pred_cutoff)
    recall = recall_score(y_test, y_pred_cutoff)
    roc_auc = roc_auc_score(y_test, y_pred_prob_mlp)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [332]:
# joblib.dump(grid_search_mlp, "grid_search_mlp.pkl")

Summary¶

Below you can see the summary of the best models of each category.

In [336]:
summary = {
    "Model (cut-off)": ["Random Forest (0.5)", "XGBoost (0.55)", "CatBoost (0.6)", "LightGBM (0.5)", "MLP 1st run (0.6)", "MLP 2nd run (0.7)", "MLP 3rd run (0.5)"],
    "F1": [0.411, 0.432, 0.484, 0.459, 0.432, 0.479, 0.456],
    "Balanced Accuracy": [0.675, 0.673, 0.826, 0.799, 0.821, 0.842, 0.811],
    "Precision": [0.400, 0.478, 0.341, 0.326, 0.284, 0.328, 0.315],
    "Recall": [0.422, 0.394, 0.835, 0.780, 0.899, 0.890, 0.826]
}

summary_df = pd.DataFrame(summary)

# Display the dataframe
summary_df
Out[336]:
Model (cut-off) F1 Balanced Accuracy Precision Recall
0 Random Forest (0.5) 0.411 0.675 0.400 0.422
1 XGBoost (0.55) 0.432 0.673 0.478 0.394
2 CatBoost (0.6) 0.484 0.826 0.341 0.835
3 LightGBM (0.5) 0.459 0.799 0.326 0.780
4 MLP 1st run (0.6) 0.432 0.821 0.284 0.899
5 MLP 2nd run (0.7) 0.479 0.842 0.328 0.890
6 MLP 3rd run (0.5) 0.456 0.811 0.315 0.826

Let's load the final one once again.

In [338]:
grid_search_mlp = joblib.load("grid_search_mlp_2.pkl")
best_mlp_model = grid_search_mlp.best_estimator_
y_pred_mlp = best_mlp_model.predict(X_test)
y_pred_prob_mlp = best_mlp_model.predict_proba(X_test)[:, 1]
In [339]:
cutoff = 0.7

y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)
cm = confusion_matrix(y_test, y_pred_cutoff)

report = classification_report(y_test, y_pred_cutoff)
f1 = f1_score(y_test, y_pred_cutoff)
balanced_acc = balanced_accuracy_score(y_test, y_pred_cutoff)
precision = precision_score(y_test, y_pred_cutoff)
recall = recall_score(y_test, y_pred_cutoff)
roc_auc = roc_auc_score(y_test, y_pred_prob_mlp)

full_report = f"Classification Report (Cutoff = {cutoff:.2f})\n\n{report}\n" \
              f"F1 Score: {f1:.3f}\n" \
              f"Balanced Accuracy: {balanced_acc:.3f}\n" \
              f"Precision: {precision:.3f}\n" \
              f"Recall: {recall:.3f}\n" \
              f"AUC: {roc_auc:.3f}"

fig, axes = plt.subplots(1, 2, figsize=(20, 8))
fig.suptitle("The best MLP, cut-off = 0.7", fontsize=18)
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[0], 
            xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
axes[0].set_title("Confusion Matrix", fontsize=14)
axes[0].set_xlabel("Predicted", fontsize=12)
axes[0].set_ylabel("Actual", fontsize=12)

axes[1].axis("off")
axes[1].text(0.5, 0.5, full_report, ha="center", va="center", wrap=True, fontsize=12)
axes[1].set_title("Metrics Report", fontsize=14)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
print("Best hyperparameters for MLPClassifier:", grid_search_mlp.best_params_)
No description has been provided for this image
Best hyperparameters for MLPClassifier: {'classifier__activation': 'tanh', 'classifier__alpha': 0.005, 'classifier__early_stopping': True, 'classifier__hidden_layer_sizes': (500, 400, 300, 200, 100), 'classifier__learning_rate_init': 0.05, 'classifier__max_iter': 300, 'classifier__solver': 'adam'}

The best model out of all tested ones is a Multilayer Perceptron with the 2nd grid specification presented in the subsection "Multilayer Perceptron | 2nd run". The hyperparameters that constitute the final model are shown above. Our decision is based on 2 main factors. First of all, we focused particularly on the Recall = 89% metric. It is because of the fact that we wanted to avoid many false negatives predicted as we are examining a medical data case - finally we are left with just 12 of them. Also, Balanced accuracy is often preferred over standard accuracy in scenarios where class imbalance is present, which is the case in this analysis. Finally, we were able to obtain a model with Balanced Accuracy exceeding 84% which is satisfactory. Both metrics' values are the biggest across all models we built, with all the remaining metrics being at least acceptable.

Additional overfitting check¶

In [344]:
pipe = ImbPipeline([
    ('preprocessor', preprocessor),
    ('resampler', ADASYN()), # the data needs reasmpling (about 95% of 0s)
    ('classifier', MLPClassifier()) # classifier; to be changed later
])

# grid
param_grid_mlp = {
    'classifier__activation': ['tanh'],
    'classifier__alpha': [0.005],
    'classifier__early_stopping': [True],
    'classifier__hidden_layer_sizes': [(500, 400, 300, 200, 100)],
    'classifier__learning_rate_init': [0.05],
    'classifier__max_iter': [300],
    'classifier__solver': ['adam']
}
# 3-fold cross-validation (later on 5-fold CV is going to be implemented)
grid_search_mlp = GridSearchCV(pipe, param_grid_mlp, cv=3, verbose=2, n_jobs=-1)
grid_search_mlp.fit(X_train, y_train)
print('')
Fitting 3 folds for each of 1 candidates, totalling 3 fits
E:\anaconda\Lib\site-packages\sklearn\utils\validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
E:\anaconda\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:1098: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

In [345]:
# the best model
best_mlp_model = grid_search_mlp.best_estimator_

# predictions for the train part
y_pred_mlp = best_mlp_model.predict(X_train)
y_pred_prob_mlp = best_mlp_model.predict_proba(X_train)[:, 1]
In [346]:
# cutoffs between 0.5 and 0.9 are to be trained
cutoffs = np.arange(0.5, 0.95, 0.05)

# I'd like a 3x3 canvas
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for Different Cutoff Values (MLP)', fontsize=16)

# confusion matrices and reports for each cutoff
for idx, cutoff in enumerate(cutoffs):
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)
    row, col = divmod(idx, 3)
    cm = confusion_matrix(y_train, y_pred_cutoff)

    # cm plot
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', ax=axes[row, col], 
                xticklabels=['No Stroke', 'Stroke'], yticklabels=['No Stroke', 'Stroke'])
    
    axes[row, col].set_title(f'Cutoff = {cutoff:.2f}')
    axes[row, col].set_xlabel('Predicted')
    axes[row, col].set_ylabel('Actual')
    
# layout adjustment
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image
In [347]:
# neat layout by ChatGPT
reports = []

for cutoff in cutoffs:
    y_pred_cutoff = (y_pred_prob_mlp >= cutoff).astype(int)

    report = classification_report(y_train, y_pred_cutoff)

    f1 = f1_score(y_train, y_pred_cutoff)
    balanced_acc = balanced_accuracy_score(y_train, y_pred_cutoff)
    precision = precision_score(y_train, y_pred_cutoff)
    recall = recall_score(y_train, y_pred_cutoff)
    roc_auc = roc_auc_score(y_train, y_pred_prob_mlp)

    full_report = f"Cutoff = {cutoff:.2f}\n\n{report}\n" \
                  f"F1 Score: {f1:.3f}\n" \
                  f"Balanced Accuracy: {balanced_acc:.3f}\n" \
                  f"Precision: {precision:.3f}\n" \
                  f"Recall: {recall:.3f}\n" \
                  f"AUC: {roc_auc:.3f}"
    reports.append(full_report)

rows, cols = 3, 3

while len(reports) < rows * cols:
    reports.append("")  

table = [reports[i:i + cols] for i in range(0, len(reports), cols)]

fig, axes = plt.subplots(rows, cols, figsize=(15, 15))
for i in range(rows):
    for j in range(cols):
        ax = axes[i, j]
        ax.axis("off")
        ax.text(0.5, 0.5, table[i][j], ha="center", va="center", wrap=True, fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image

It turns out that there's no overfitting. The differences in results (slightly worse metrics than before) appeared probably because of the sample size. The model seem to perform within the same pattern for both train and test samples.