WQD7001 • GROUP 1-9

Data-Driven Prediction of
Cardiovascular Disease

The Impact of Smoking & Health Factors

Leader Jiang Yunxi
Maker Huang Wenqi
Oracle Jia Xinping
Detective Suo Zihang
Secretary Yuan Wenyu

18M+

Global Deaths/Year

Cardiovascular disease is a silent killer. Traditional diagnosis is reactive, often missing early warning signs.

Business Objectives

  • Early Screening
    Identify high-risk patients before events occur.
  • Targeted Intervention
    Focus on modifiable factors like smoking dosage.
  • Public Awareness
    Visualize risks for better patient understanding.

EDA Insight: Dosage Matters

It's not just if you smoke, but how much.

Non-Smoker
14.3%
General Smoker
15.8%
Heavy Smoker
27.2% Risk

"Heavy smoking (>20 cigs/day) nearly doubles the CVD risk compared to the baseline. This non-linear relationship was key to our modeling strategy."

Model Performance Comparison

Logistic Regression

Accuracy: 67% - Failed to capture complexity.

XGBoost

Accuracy: 87% - Good, but lower Recall.

SELECTED

Random Forest

Accuracy91.0%
Recall (Sensitivity)92.7%

Why? In medicine, missing a positive case (False Negative) is dangerous. Random Forest offers the highest safety margin.

Opening the "Black Box"

We used SHAP Values to explain why the model makes a prediction.

#1 Feature: EDUCATION

Surprisingly, education level is a top predictor. High education correlates with better health awareness and lifestyle quality.

#2 Feature: AGE & BP

As expected, biological aging and hypertension are primary drivers of risk.

# Python Code Snippet
import shap
explainer = shap.TreeExplainer(model)
shap.summary_plot(shap_values, X_test) # Revealed Education as #1 negative contributor to risk

Plan for Reproducible Research

import numpy as np
import random

# 1. Global Random Seed for Consistency
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# 2. Strict Directory Structure
project/
├── data/ # Raw & Processed csv
├── models/ # Saved .pkl files
└── requirements.txt

Deployed Product: CVD Risk Dashboard

Age
50 YRS
Cigs Per Day
0 CIGS
Systolic BP
120 mmHg
Glucose
85 mg/dL
Run Analysis
Assessment Result
Low Risk

Probability: 5.4%

Vitals Analysis
BMI
25.0
BP (Ref: <120)
120/80
Glucose
85

Strategic Value

Lifestyle Proxy

Targeting education is as effective as medical intervention.

Actionable

"Reduce smoking quantity" is a measurable goal.

Efficiency

Instant deployment enables large-scale screening.

Conclusion

From Raw Data to Life-Saving Insight.

91%
ACCURACY
92.7%
RECALL
Ready
DEPLOYMENT

Thank You

Group 1-9 is open for questions.