#9 - Separating Numerical, Categorical, and Textual Data

Data separation based on its type

#9 - Separating Numerical, Categorical, and Textual Data

Introduction

For every tabular data, we know that the data is stored in the form of a matrix (rows and columns). It is obvious that every column may not be of the same data type. We tend to have data of different types. Some column values may be integers (numerical), real-valued (float), and categorical (string). The main agenda of this article is to explore some unknown methods which can effectively work for separating the columns based on their type.

Unfortunately, if in the dataset there are both categorical and textual data, Pandas will consider both as categorical data. But, with a simple hack, we can differentiate that easily.

  • If we have categorical data like ['short', 'tall', 'average'], we can still extract the information even though each value is of string type. We can plot a pie chart or bar chart to visualize it.

  • But, if we have textual data like [ 'This is the first document.', 'This document is the second document.'], we cannot do much for this although the type of each value is a string, we know that it is actually a sentence.

With Pandas, we will see how to solve this separation task programmatically.

Credits of Cover Image - Photo by Will Francis on Unsplash

import Packages

import numpy as np
import pandas as pd
import random

Scenario 1

For this example, we will create our own (dummy) data set.

real_values = np.random.rand(10, 3)
num_cate = np.array([random.choice([1, 2, 3]) for i in range(10)])
str_cate = np.array([random.choice(['short', 'tall', 'average']) for i in range(10)])
# -----
df = pd.DataFrame(data=real_values, columns=['col_1', 'col_2', 'col_3'])
df['col_4'] = num_cate
df['col_5'] = str_cate

The first five rows of df look like -

col_1col_2col_3col_4col_5
0.9025520.2123170.2086282short
0.3085030.9599030.6317843tall
0.4597890.1997420.1760002tall
0.7607870.1671790.1372872average
0.1997370.1202130.1450682average

We can check the type of each column by -

df.dtypes

'''
col_1    float64
col_2    float64
col_3    float64
col_4      int32
col_5     object
'''

We can see that col_5 is of type object which simply represents that it is different from int and float. To take it to the next level, we can have a function that can give the columns that are numerical and string separately.

def split_categories_numericals(dframe):
    cols = list(dframe.columns)
    num_cols = list(dframe._get_numeric_data().columns)
    cate_cols = list(set(cols) - set(num_cols))
    return cate_cols, num_cols
cate_cols, num_cols = split_categories_numericals(dframe=df)
print(cate_cols) # ['col_5']
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']

Scenario 2

The function split_categories_numericals() works fine for the above data set. Let's add one more column that has textual sentences as values.

sents = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'Hello, how are you doing?',
    'Did you go to the party yesterday?'
]
text_data = [random.choice(sents) for i in range(10)]
df['col_6'] = text_data

Now, the first five rows of the df look like -

col_1col_2col_3col_4col_5col_6
0.9025520.2123170.2086282shortThis document is the second document.
0.3085030.9599030.6317843tallIs this the first document?
0.4597890.1997420.1760002tallHello, how are you doing?
0.7607870.1671790.1372872averageDid you go to the party yesterday?
0.1997370.1202130.1450682averageHello, how are you doing?

If we check the dtypes of df, we get -

df.dtypes

'''
col_1    float64
col_2    float64
col_3    float64
col_4      int32
col_5     object
col_6     object
'''

We can see col_5 and col_6 have the same type, i.e., object. It is because they have string values. But, col_5 - we can get a pie chart, compute the likelihood, etc. Whereas col_6, since it has textual data, we need to convert it into vectors (numerical vectors) and then analyze it.

Now, the ultimate question is how can we know if a data set has textual data (sentences) as values. If we try to apply split_categories_numericals(), we get -

cate_cols, num_cols = split_categories_numericals(dframe=df)
print(cate_cols) # ['col_5', 'col_6']
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']

It considers both col_5 and col_6 as categorical.

Is there any programming hack to automatically differentiate categorical data and textual data?

The answer is Yes. We can very well achieve that with the following function.

def split_text_categories_numericals(dframe):
    cols = list(dframe.columns)

    num_cols = list(dframe._get_numeric_data().columns)
    cate_cols = list(set(cols) - set(num_cols))

    text_cols = []; category_cols = []
    for ccol in cate_cols:
        each_col_list = dframe[ccol].str.split(' ').to_list()
        col_val_len_arr = np.array(list(map(len, each_col_list)))

        if np.any(col_val_len_arr > 1):
            text_cols.append(ccol)
        else:
            category_cols.append(ccol)

    return num_cols, category_cols, text_cols
num_cols, cate_cols, text_cols = split_text_categories_numericals(dframe=df)
print(num_cols) # ['col_1', 'col_2', 'col_3', 'col_4']
print(cate_cols) # ['col_5']
print(text_cols) # ['col_6']

We have separated the columns with a simple hack now.

Use Cases

  • If in the data set, there are columns of different types, and if you want to do text processing, using the above function you can exactly get the columns that have textual data (sentences).

  • By extracting textual columns, you can easily process it and get the most out of it.

Challenges

  • The function split_text_categories_numericals() only works for the preprocessed and cleaned data.

  • It might not work efficiently if the function is applied directly to the original data.

End