What causes heart failures and how to predict them? This questions can be answered using this dataset on Kaggle and apply explanatory statistics as well as machine learning algorithms.
Starting the analysis by explaining the dataset:
The dataset contains 299 observations with 13 variables. Target variable is DEATH_EVENT and is to be predicted.
Let’s take a closer look on some of the distributions of the data using boxplot diagrams. We split the data based on DEATH_EVENT TRUE (1) or DEATH_EVENT FALSE (0)
It seems, that persons with heart failure in this dataset are slightly older, have less ejection fraction (Percentage of blood leaving the heart at each contraction) and serum sodium.
Looking at the distributions of the values below, we can assume, that we have almost the same distributions of men and women regarding the age in the dataset. Also the time (Follow-Up period) seems to have a correlation with a heart failure.
The Welch-Test confirms the equal distribution of age and for men and women, and also shows that smoking might have an impact on the ejection fraction.
Checking the data on possible multicollinearity, we can see that between smoking and sex seems to be a rather strong relationship.
Taking a closer look on this relationship, we see that there are 105 smokers, while 101 of them are male, and only 4 are female.
In order to be able to predict future cases, we use Logistic Regression and achieve a high accuracy with roughly 86%.
The source code of the model and more details can be reviewed here on Kaggle.