The data science lifecycle is designed for big data issues and data science projects. Generally, the data science project consists of seven steps which are problem definition, data collection, data preparation, data exploration, data modeling, model evaluation and model deployment. This article goes through the data science lifecycle in order to build a web application for heart disease classification.
If you would like to look at a specific step in the lifecycle, you can read it without looking deeply at the other steps.
Problem Definition
Clinical decisions are often made based on doctors’ experience and intuition rather than on the knowledge-rich hidden in the data. This leads to errors and many costs that affect the quality of medical services. Using analytic tools and data modeling can help in enhancing the clinical decisions. Thus, the goal here is to build a web application to help doctors in diagnosing heart diseases. The full code of is available in my GitHub repository.
Data Collection
I collected the heart disease dataset from UCI ML. The dataset has the following 14 attributes:
- age: age in years.
- sex: sex (1=male; 0=female).
- cp: chest pain type (0 = typical angina; 1 = atypical angina; 2 = non-anginal pain; 3: asymptomatic).
- trestbps: resting blood pressure in mm Hg on admission to the hospital.
- chol: serum cholesterol in mg/dl.
- fbs: fasting blood sugar > 120 mg/dl (1=true; 0=false).
- restecg: resting electrocardiographic results ( 0=normal; 1=having ST-T wave abnormality; 2=probable or definite left ventricular hypertrophy).
- thalach: maximum heart rate achieved.
- exang: exercise-induced angina (1=yes; 0=no).
- oldpeak: ST depression induced by exercise relative to rest.
- slope: the slope of the peak exercise ST segment (0=upsloping; 1=flat; 2=downsloping).
- ca: number of major vessels (0–3) colored by fluorosopy.
- thal: thalassemia (3=normal; 6=fixed defect; 7=reversable defect).
- target: heart disease (1=no, 2=yes).
Data Preparation and Exploration
Here are the top 5rows of the dataset
Python Code:
The header of the heart disease datase