This repository contains a complete data science pipeline for predicting heart attack risk using machine learning for 4IZ565/4IZ566 – Programming for Data Science in Python. The project includes:
- Data preparation and feature engineering
- Exploratory data analysis and feature selection
- Model training, hyperparameter tuning, and evaluation
heart_attack_prediction_dataset.csv: Raw dataset with medical and demographic dataheart_attack_feature_engineered.csv: Cleaned and enriched dataset after preprocessingdata_preparation.ipynb: Data cleaning and feature engineering notebookexploratory_analysis_and_feature_selection.ipynb: EDA, visualizations, and feature selection notebookmodel_training_and_evaluation.ipynb: Model training, hyperparameter tuning, and evaluation notebook- Visualizations:
correlation_heatmap.png: Heatmap of feature correlationsfeature_histograms_continuous.png: Histograms of continuous featuresfeature_binary_counts.png: Count plots of binary featuresrf_feature_importance.png: Feature importance from Random Forestheart_attack_roc_curve_v2.png: ROC curve comparing model performance
The dataset comes from https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset and contains medical and demographic data for 8,786 patients. The goal is to identify factors that influence heart attack risk and build predictive models.
- Blood Pressure (Systolic/Diastolic)
- BMI and BMI Category
- Cholesterol and Triglycerides
- Pulse Pressure and Heart Rate
- Lifestyle (Exercise, Sedentary hours, Diet)
- Demographics (Age, Age Group, Sex, Continent, Income)
Target variable:
Heart Attack Risk: 0 = No Risk, 1 = High Risk
Performed in data_preparation.ipynb using pandas:
- Converted
Blood Pressurestring into numericSystolic BPandDiastolic BP - Created new features:
-
Cardio Risk Score: A composite health risk index calculated using a weighted formula based on five key factors known to influence cardiovascular risk:AgeSex(male = higher risk)Systolic BP(systolic blood pressure)Cholesterol(total cholesterol level)- (Originally included Smoking, but omitted if not available in dataset)
The formula used:
Cardio Risk Score = 0.03 × Age + 1.0 × (1 if Male else 0) + 0.02 × Systolic BP + 0.01 × Cholesterol -
Pulse Pressure= Systolic - Diastolic -
BMI CategoryandAge Group
-
- Added geographic features:
Patient count per continent - Saved the output as
heart_attack_feature_engineered.csv
Performed in exploratory_analysis_and_feature_selection.ipynb
- Separated features into categorical, binary, continuous, and numerical groups
- Visualizations:
correlation_heatmap.png: Correlation heatmap for numerical featuresfeature_histograms_continuous.png: Histograms for continuous variablesfeature_binary_counts.png: Count plots for binary variables
- Feature selection:
- RFE (Recursive Feature Elimination) using Logistic Regression
- Feature importances using Random Forest (
rf_feature_importance.png)
Performed in model_training_and_evaluation.ipynb
- Preprocessing:
- Scaling numerical features
- One-hot encoding categorical variables
- Classifiers:
- Logistic Regression
- Random Forest
- Support Vector Classifier (SVC)
- Data split: 80/20 train/test
- Hyperparameter tuning with
GridSearchCV - Model selection based on test and cross-validation scores
- Logistic Regression: 0.6418
- Random Forest: 0.6414
- SVC: 0.6418
- Saved as
heart_attack_roc_curve_v2.png - Displays AUC scores and performance comparison
- 5-fold CV for each model
- Average accuracy printed for all
- Feature selection aided by Random Forest and RFE
- Three different classifier types used for diversity
- Pipelines prevent data leakage and ensure reproducibility
To run this project locally:
- Clone this repository or download the files
- Make sure
heart_attack_prediction_dataset.csvis in your working directory - Open notebooks in this order:
data_preparation.ipynbexploratory_analysis_and_feature_selection.ipynbmodel_training_and_evaluation.ipynb
- Run each notebook cell-by-cell to reproduce results
Make sure the following Python libraries are installed in your environment:
pandasmatplotlibseabornscikit-learn
You can install them via:
pip install pandas matplotlib seaborn scikit-learn