Skip to content

Semestrálka v pythonu pro předmět programování v pythonu pro data science

Notifications You must be signed in to change notification settings

Bebicek/4IZ565-Python_dataming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heart Attack Risk Prediction

Project Structure

This repository contains a complete data science pipeline for predicting heart attack risk using machine learning for 4IZ565/4IZ566 – Programming for Data Science in Python. The project includes:

  • Data preparation and feature engineering
  • Exploratory data analysis and feature selection
  • Model training, hyperparameter tuning, and evaluation

Files

  • heart_attack_prediction_dataset.csv: Raw dataset with medical and demographic data
  • heart_attack_feature_engineered.csv: Cleaned and enriched dataset after preprocessing
  • data_preparation.ipynb: Data cleaning and feature engineering notebook
  • exploratory_analysis_and_feature_selection.ipynb: EDA, visualizations, and feature selection notebook
  • model_training_and_evaluation.ipynb: Model training, hyperparameter tuning, and evaluation notebook
  • Visualizations:
    • correlation_heatmap.png: Heatmap of feature correlations
    • feature_histograms_continuous.png: Histograms of continuous features
    • feature_binary_counts.png: Count plots of binary features
    • rf_feature_importance.png: Feature importance from Random Forest
    • heart_attack_roc_curve_v2.png: ROC curve comparing model performance

Description of the Dataset

The dataset comes from https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset and contains medical and demographic data for 8,786 patients. The goal is to identify factors that influence heart attack risk and build predictive models.

Features include:

  • Blood Pressure (Systolic/Diastolic)
  • BMI and BMI Category
  • Cholesterol and Triglycerides
  • Pulse Pressure and Heart Rate
  • Lifestyle (Exercise, Sedentary hours, Diet)
  • Demographics (Age, Age Group, Sex, Continent, Income)

Target variable:

  • Heart Attack Risk: 0 = No Risk, 1 = High Risk

Data Preprocessing

Performed in data_preparation.ipynb using pandas:

Key steps:

  • Converted Blood Pressure string into numeric Systolic BP and Diastolic BP
  • Created new features:
    • Cardio Risk Score: A composite health risk index calculated using a weighted formula based on five key factors known to influence cardiovascular risk:

      • Age
      • Sex (male = higher risk)
      • Systolic BP (systolic blood pressure)
      • Cholesterol (total cholesterol level)
      • (Originally included Smoking, but omitted if not available in dataset)

      The formula used:

      Cardio Risk Score = 
          0.03 × Age +
          1.0 × (1 if Male else 0) +
          0.02 × Systolic BP +
          0.01 × Cholesterol
      
    • Pulse Pressure = Systolic - Diastolic

    • BMI Category and Age Group

  • Added geographic features: Patient count per continent
  • Saved the output as heart_attack_feature_engineered.csv

Exploratory Data Analysis & Feature Selection

Performed in exploratory_analysis_and_feature_selection.ipynb

Key tasks:

  • Separated features into categorical, binary, continuous, and numerical groups
  • Visualizations:
    • correlation_heatmap.png: Correlation heatmap for numerical features
    • feature_histograms_continuous.png: Histograms for continuous variables
    • feature_binary_counts.png: Count plots for binary variables
  • Feature selection:
    • RFE (Recursive Feature Elimination) using Logistic Regression
    • Feature importances using Random Forest (rf_feature_importance.png)

Modeling

Performed in model_training_and_evaluation.ipynb

Pipeline steps:

  • Preprocessing:
    • Scaling numerical features
    • One-hot encoding categorical variables
  • Classifiers:
    • Logistic Regression
    • Random Forest
    • Support Vector Classifier (SVC)
  • Data split: 80/20 train/test
  • Hyperparameter tuning with GridSearchCV
  • Model selection based on test and cross-validation scores

Results and Evaluation

Accuracy (test set):

  • Logistic Regression: 0.6418
  • Random Forest: 0.6414
  • SVC: 0.6418

ROC Curve:

  • Saved as heart_attack_roc_curve_v2.png
  • Displays AUC scores and performance comparison

Cross-Validation:

  • 5-fold CV for each model
  • Average accuracy printed for all

Notes and Design Decisions

  • Feature selection aided by Random Forest and RFE
  • Three different classifier types used for diversity
  • Pipelines prevent data leakage and ensure reproducibility

Reproducibility Instructions

To run this project locally:

  1. Clone this repository or download the files
  2. Make sure heart_attack_prediction_dataset.csv is in your working directory
  3. Open notebooks in this order:
    • data_preparation.ipynb
    • exploratory_analysis_and_feature_selection.ipynb
    • model_training_and_evaluation.ipynb
  4. Run each notebook cell-by-cell to reproduce results

Dependencies

Make sure the following Python libraries are installed in your environment:

  • pandas
  • matplotlib
  • seaborn
  • scikit-learn

You can install them via:

pip install pandas matplotlib seaborn scikit-learn

About

Semestrálka v pythonu pro předmět programování v pythonu pro data science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages