#7 - Handle Missing Values in Data

Strategies to handle missing values

#7 - Handle Missing Values in Data

What is a missing value?

Missing value simply refers to a situation where there is no data stored for the variable in an observation. This usually happens when there is no response detected for a particular question during the process of data collection. Data Analysts have to be very careful in dealing with missing values. The process to handle the missing data is called imputation. On the whole, imputation is helpful to some extent but not fully in replacing the actual value that can be assumed in the place of missing value.

Credits of Cover Image - Photo by Alex Jones on Unsplash

Techniques of Imputation

  1. Filling by a Central Tendency
    • Mean replacement
    • Median replacement
    • Mode replacement
  2. Categorical Imputation
    • For each category, compute central tendency and fill the missing value for that category itself.
  3. Model Imputation
    • Building a machine learning model to fill up the missing data by the predicted values.

Note - In this article, we will focus only on 1 and 2 techniques. We will not focus on 3 just for now.

For implementing the above methods we will need to first create a toy dataset in order to programmatically understand each technique.

Necessary imports

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import random

from sklearn.datasets import make_blobs
from scipy import stats
from matplotlib import pyplot as plt

Dataset Creation

In our data, we shall make sure that we have at least one feature which is categorical. The method make_blobs() can be really handy in creating toy datasets in our own experiments.

With this method, we will create a dataset that will have -

  • 100 samples
  • 2 features
  • 3 centers
X, y = make_blobs(n_samples=100, n_features=2, centers=3)
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
print(df.head())

The gist of the data looks like this -

Indexxylabel
0-1.960223-2.2404002
1-3.489893-3.1493562
2-8.1732176.1867631
3-1.216968-3.3185242
4-7.6619759.0925141

Similarly, the original plot of the data looks like this -

orig_data.png

Missing values at random locations

In Python, a missing value is represented as NaN which means Not a Number. We will create missing values in the y column for the above toy dataset.

df['y'].loc[df['y'].sample(frac=0.1).index] = np.nan

The above code simply means that we are creating 10% of missing values completely at random locations of the existing data frame.

df.isnull().sum()

# ------
'''
x         0
y        10
label     0
dtype: int64
'''

Computing Central Tendency

Since, we have to impute the missing values with central tendencies like mean, median, and mode. We will create a helper function that takes three params -

  • data → The dataset for which the central tendencies should be computed.
  • col → The column for which the central tendencies should be computed.
  • strategy → Mean, Median, and Mode strategies.
def fill_values(data, col, strategy):
    if (strategy == 'mean'):
        fval = data[col].mean()
    elif (strategy == 'median'):
        fval = data[col].mean()
    else:
        fval = stats.mode(data[col])[0][0]
    return fval

The above function returns a value based on the strategy that is used.

1) Filling by a Central Tendency

In order to fill the NaN values with new values in the existing column of the data, we need to have a function that does so. The function takes three params -

  • data → The dataset for which the central tendencies should be computed.
  • col → The column for which the central tendencies should be computed.
  • strategy → Mean, Median, and Mode strategies.
def simple_imputer(data, col, strategy):
    fval = fill_values(data=data, col=col, strategy=strategy)
    non_nans = data[col].fillna(fval).to_list()
    return non_nans

The above function returns a list with filled values based on the strategy that is used.

Note - The above technique is very simple and would not produce good results when compared with other techniques.

If we try to visualize the data after filling the NaN's, we would see some sort of dispersion in the data.

Inkedsimple_impute_LI.jpg

We can see a horizontal pattern formed by points. These points are actually NaN values that got filled by a respective central tendency. And moreover, neither of them belong to any of the clusters.

2) Categorical Imputation

In our original dataset, besides x and y columns we also have one column called label that 3 unique values such as [0, 1, 2]. A simple hack that we can incorporate here is, we can separate the data by clusters and then compute the central tendency to fill up the NaN values. By doing so, we can make sure that the new values belong to their respective clusters.

As usual, we will have a function to separate data cluster-wise. The function takes four params -

  • data → The dataset for which the central tendencies should be computed.
  • col → The column for which the central tendencies should be computed.
  • label → The column with which the data is separated into clusters.
  • strategy → Mean, Median, and Mode strategies.
def label_imputer(self, data, col, label, strategy):
    col_label_df = data[[col, label]]
    counts_df = col_label_df[label].value_counts().to_frame()
    classes = counts_df.index.to_list()

    fval_class = []
    for each_class in classes:
        # separating data into clusters
        mini_frame = col_label_df[col_label_df[label] == each_class]
        # computing central tendency
        non_nans_frame = pd.Series(data=simple_imputer(mini_frame, col, strategy), index=mini_frame.index)
        fval_class.append(non_nans_frame)
    final_vals = pd.concat(fval_class).sort_index().to_list()

    return final_vals

The above function is used to compute the central tendency value based upon the cluster that it belongs to and finally returns a list of values.

If we try to visualize the data after filling the NaN's, we would not see any sort of dispersion in the data but rather a uniformity and resemblance with the original plot (without NaN values).

label_imputer.png


Full Code

class Imputer():
    def __init__(self, strategy):
        self.available_strats = ['mean', 'median', 'mode']
        self.strategy = 'median' if strategy not in self.available_strats else strategy

    def fill_values(self, data, col):
        if (self.strategy == 'mean'):
            fval = data[col].mean()
        elif (self.strategy == 'median'):
            fval = data[col].mean()
        else:
            fval = stats.mode(data[col])[0][0]
        return fval

    def simple_imputer(self, data, col):
        fval = self.fill_values(data=data, col=col)
        non_nans = data[col].fillna(fval).to_list()
        return non_nans

    def label_imputer(self, data, col, label):
        col_label_df = data[[col, label]]
        counts_df = col_label_df[label].value_counts().to_frame()
        classes = counts_df.index.to_list()

        fval_class = []
        for each_class in classes:
            mini_frame = col_label_df[col_label_df[label] == each_class]
            non_nans_frame = pd.Series(data=self.simple_imputer(data=mini_frame, col=col), index=mini_frame.index)
            fval_class.append(non_nans_frame)
        final_vals = pd.concat(fval_class).sort_index().to_list()

        return final_vals

Conclusion

  • Throughout this article, we saw different techniques to handle the missing data. But, this was implemented on a toy dataset, not on a real dataset.

  • Practicing the above on real data may or may not be suitable. And moreover, it is quite impossible to have a categorical feature to separate the data in clusters and fill the values based on the clusters themselves.

  • It might be beneficial to practice the third technique which predicts the missing values by a mathematical model and this is an unbiased technique that entirely depends on the training data and a model itself.

  • It is all up to the practitioner to choose any method and come up with a solution for the given problem.


Buy Me Coffee

If you have liked my article you can buy some coffee and support me here. That would motivate me to write and learn more about what I know.

Well, that's all for now. This article is included in the series Exploratory Data Analysis, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.