#2 - Determine objectives and understand the data better

Formulate the objectives from the data for further analytical process

#2 - Determine objectives and understand the data better

What is an Objective?

In any business problem, data analysts are often asked to find out the patterns and trends from the data in order to identify the strong root cause of problems (why something is happening). Throughout the time, they try to maintain the quality of the data to be precise in finding the true objectives. These objectives (targets) are not formulated from personal opinions or prejudices.

Credits of Cover Image - Photo by Silvan Arnet on Unsplash

To first understand the data, one has to be a great observant. By just observing the data one can know what kind of facts are given and what are those factors - can be considered to be the most important features. This stage is called Data Inspection.

Data Inspection

It is in this stage where one acts as Sherlock Holmes to understand the data better.

sherlock_holmes.gif

As soon as the data is loaded, to make sure one is following the right track one has to be curious to ask the following questions -

  1. What is the problem statement?
  2. What is the structure of the data?
  3. What are the variables present?
    • What is the type of each variable?
  4. What is the total size of the data?
  5. Are there any missing or NaN values within the data?

Let's try to understand one dataset from the above questions along with the objectives.

Note - Throughout this series, we will be dealing with one dataset originally taken from Kaggle.

Hands-on Practice

For the hands-on practice, we will use one dataset in the domain of health care on heart attack possibility.

Let's get started ...

We are using Python and specifically Pandas library that was developed with a clear intention to approach the key aspects of data analytics and data science problems.

Installation

pip install pandas --user
pip install numpy --user

The above installs the current version of packages in the system. Let's import the same.

import pandas as pd
import numpy as np

1) What is the problem statement?

The title of the dataset itself gives a brief idea of the problem statement that we need to work on, i.e., predicting the possibility of a heart attack. But, for this article, our goal is to understand the data by answering the above questions.

Load / Read

In order to load or read the data, we use the read_csv() method where we explicitly tell Pandas that we want to read a CSV file.

df = pd.read_csv('heart.csv)

2) What is the structure of the data?

The dataset that we have just loaded is a CSV file data where all the values are separated with a character ,. By default, the data is read as a DataFrame which is basically tabular data.

df.info()

Since we have made an object called df, we can instantiate different methods available in the scope. One such method is info() which gives the information related to the data frame.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.2 KB

No doubt that we have got the entire information, structure, columns by just one command. We also got the types of each column which indirectly answers the 3rd question.

3) What are the variables present?

Variables are also called features which are the columns of the data. Each column or variable has a specific type of data that can be int, float, object etc.

df.columns

The above gives a list of columns or variables and the type of column name. All the columns are read as object.

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

3.1) What is the type of each variable?

df.dtypes

The above lists out the type of data that each variable or column holds.

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

4) What is the total size of the data?

The size of the data can be determined by -

df.shape

this gives the total number of rows and columns in the form of a tuple.

(303, 14)

The dataset has in total 303 rows and 14 columns. Now comes the last and final question.

5) Are there any missing or NaN values within the data?

Missing values in the data indicate that the value for that column or variable at a particular index is Null or None. In order to find the same, we have a method called isnull() which can be understood in this way -

is the data frame null?

df.isnull()

This, in return, creates a new data frame of boolean values representing

  • True, wherever the value is Null
  • False, wherever the value is NOT-Null
    age    sex    cp     trestbps chol   fbs    restecg  thalach  exang  oldpeak  slope  ca     thal   target
0   False  False  False  False    False  False  False    False    False  False    False  False  False  False
1   False  False  False  False    False  False  False    False    False  False    False  False  False  False
2   False  False  False  False    False  False  False    False    False  False    False  False  False  False
    ...    ...    ...    ...      ...    ...    ...      ...      ...    ...      ...    ...    ...    ...
301 False  False  False  False    False  False  False    False    False  False    False  False  False  False
302 False  False  False  False    False  False  False    False    False  False    False  False  False  False

However, looking at the above result, it can be tedious to exactly pinpoint the True value signifying the Null in that location. We can achieve it with the help of NumPy's np.where() method.

def get_nan_location(dframe):
    nan_locs = {}
    for col in dframe.columns:
        nl = list(np.where(dframe[col].isnull())[0])
        nan_locs[col] = nl
    return nan_locs

The above takes one parameter i.e., dframe (the actual data frame), and returns a dictionary where each key is the column and each value is the list of indices where there is NaN pertaining to that respective column.

get_nan_location(dframe=df)
#--------------------------
{'age': [],
 'sex': [],
 'cp': [],
 'trestbps': [],
 'chol': [],
 'fbs': [],
 'restecg': [],
 'thalach': [],
 'exang': [],
 'oldpeak': [],
 'slope': [],
 'ca': [],
 'thal': [],
 'target': []}

We can see all the columns have an empty list which simply indicates that there are no NaN values.

Once this step or stage is over, we should be formulating other questions that mainly depend on the data we are working with.


Well, that's all for now. This article is included in the series Exploratory Data Analysis, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.