#2 - Determine objectives and understand the data better
Formulate the objectives from the data for further analytical process
What is an Objective?
In any business problem, data analysts are often asked to find out the patterns and trends from the data in order to identify the strong root cause of problems (why something is happening). Throughout the time, they try to maintain the quality of the data to be precise in finding the true objectives. These objectives (targets) are not formulated from personal opinions or prejudices.
Credits of Cover Image - Photo by Silvan Arnet on Unsplash
To first understand the data, one has to be a great observant. By just observing the data one can know what kind of facts are given and what are those factors - can be considered to be the most important features. This stage is called Data Inspection.
Data Inspection
It is in this stage where one acts as Sherlock Holmes to understand the data better.
As soon as the data is loaded, to make sure one is following the right track one has to be curious to ask the following questions -
- What is the problem statement?
- What is the structure of the data?
- What are the variables present?
- What is the type of each variable?
- What is the total size of the data?
- Are there any missing or
NaN
values within the data?
Let's try to understand one dataset from the above questions along with the objectives.
Note - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.
Hands-on Practice
For the hands-on practice, we will use one dataset in the domain of health care on heart attack possibility.
Let's get started ...
We are using Python and specifically Pandas library that was developed with a clear intention to approach the key aspects of data analytics and data science problems.
Installation
pip install pandas --user
pip install numpy --user
The above installs the current version of packages in the system. Let's import the same.
import pandas as pd
import numpy as np
1) What is the problem statement?
The title of the dataset itself gives a brief idea of the problem statement that we need to work on, i.e., predicting the possibility of a heart attack. But, for this article, our goal is to understand the data by answering the above questions.
Load / Read
In order to load or read the data, we use the read_csv()
method where we explicitly tell Pandas that we want to read a CSV
file.
df = pd.read_csv('heart.csv)
2) What is the structure of the data?
The dataset that we have just loaded is a CSV
file data where all the values are separated with a character ,
. By default, the data is read as a DataFrame
which is basically tabular data.
df.info()
Since we have made an object called df
, we can instantiate different methods available in the scope. One such method is info()
which gives the information related to the data frame.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB
No doubt that we have got the entire information, structure, columns by just one command. We also got the types of each column which indirectly answers the 3rd question.
3) What are the variables present?
Variables are also called features which are the columns of the data. Each column or variable has a specific type of data that can be int
, float
, object
etc.
df.columns
The above gives a list of columns or variables and the type of column name. All the columns are read as object
.
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
dtype='object')
3.1) What is the type of each variable?
df.dtypes
The above lists out the type of data that each variable or column holds.
age int64
sex int64
cp int64
trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca int64
thal int64
target int64
dtype: object
4) What is the total size of the data?
The size of the data can be determined by -
df.shape
this gives the total number of rows and columns in the form of a tuple
.
(303, 14)
The dataset has in total 303
rows and 14
columns. Now comes the last and final question.
5) Are there any missing or NaN
values within the data?
Missing values in the data indicate that the value for that column or variable at a particular index is Null
or None
. In order to find the same, we have a method called isnull()
which can be understood in this way -
is the data frame null?
df.isnull()
This, in return, creates a new data frame of boolean
values representing
True
, wherever the value isNull
False
, wherever the value isNOT-Null
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
301 False False False False False False False False False False False False False False
302 False False False False False False False False False False False False False False
However, looking at the above result, it can be tedious to exactly pinpoint the True
value signifying the Null
in that location. We can achieve it with the help of NumPy's np.where()
method.
def get_nan_location(dframe):
nan_locs = {}
for col in dframe.columns:
nl = list(np.where(dframe[col].isnull())[0])
nan_locs[col] = nl
return nan_locs
The above takes one parameter i.e., dframe
(the actual data frame), and returns a dictionary where each key is the column and each value is the list of indices where there is NaN
pertaining to that respective column.
get_nan_location(dframe=df)
#--------------------------
{'age': [],
'sex': [],
'cp': [],
'trestbps': [],
'chol': [],
'fbs': [],
'restecg': [],
'thalach': [],
'exang': [],
'oldpeak': [],
'slope': [],
'ca': [],
'thal': [],
'target': []}
We can see all the columns have an empty list which simply indicates that there are no NaN
values.
Once this step or stage is over, we should be formulating other questions that mainly depend on the data we are working with.
Well, that's all for now. This article is included in the series Exploratory Data Analysis, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.