The primary objective of this project was to develop a machine learning model capable of identifying patients at high risk for cardiovascular disease. Through a systematic evaluation of linear (SVM) and non-linear (Decision Tree) algorithms, the analysis revealed a significant challenge with overfitting.
While initial unconstrained models achieved near-perfect training scores, they failed to generalize to new data. By implementing Hyperparameter Tuning via Grid Search, the model was stabilized, moving from "memorizing" the training set to identifying broader clinical patterns. This project serves as a detailed case study in balancing model complexity with real-world reliability.
Raw clinical data was transformed to ensure biological and statistical validity:
- Feature Categorization: Features were grouped into Objective (Age, Height), Examination (Blood Pressure, Cholesterol), and Subjective (Smoking, Alcohol) categories.
- Feature Engineering: Age was converted from days to years, and BMI was calculated to provide standard clinical context.
- Outlier Management: Instances where blood pressure readings were physically impossible (e.g., diastolic higher than systolic) were removed to ensure data integrity.
We first established a performance baseline using a Linear SVM:
- Outcome: The model achieved a 0.7232 training accuracy but suffered from a "no-skill" result on the test set (AUC 0.49).
- Finding: A linear boundary was insufficient to capture the complex relationships between these health indicators.
To capture non-linear interactions, we implemented a Decision Tree:
- The Overfitting Issue: Without constraints, the tree hit 99.4% training accuracy.
- Generalization Failure: Testing accuracy dropped to 48.4%, indicating the model was learning noise rather than transferable medical patterns.
We used Grid Search with 5-fold cross-validation to prune the tree and find a stable middle ground:
- Optimized Parameters:
max_depth: 5andmin_samples_leaf: 20. - Stabilization: Training accuracy was brought down to a more honest 72.3%, narrowing the gap between training and testing performance.
| Metric | Linear SVM | Unconstrained DT | Optimized Decision Tree |
|---|---|---|---|
| Training Accuracy | 0.7232 | 0.9943 | 0.7232 |
| Testing Accuracy | 0.4947 | 0.4844 | 0.4947 |
| Testing Recall | 1.0000 | 0.7446 | 1.0000 |
- Generalization over Accuracy: High training scores (99%) are often a warning sign of overfitting. Real clinical utility requires depth constraints to handle noisy medical data.
- Prioritizing Sensitivity: The final model achieved a 1.00 Recall on the test set, ensuring that no high-risk patients are missed—a critical priority in healthcare settings.
- Future Recommendations: Moving toward Ensemble Methods like Random Forest or Gradient Boosting is recommended to further improve precision while maintaining the stability achieved through Grid Search
- Clone this repository.
- Install dependencies:
pip install -r requirements.txt - Open
Cardiovascular_Risk_Analysis.ipynbin Jupyter Notebook.